vLLM V1 fixes logprobs and reaches parity with V0 | Keryc
vLLM V1 matches the behavior of vLLM V0 in an RL experiment after four key fixes: processed_logprobs, V1-specific runtime defaults, the inflight weight-update path, and the fp32 lm_head for the final projection. Before touching the RL objective, the team fixed the backend and then evaluated objective-level changes.
What happened in a nutshell
Why did this change matter so much? Because in online RL systems, the logprobs coming from the backend are a direct part of the objective function. If those logprobs don't mean the same thing to the trainer and the rollout backend, optimization becomes biased.
In the cited experiment, the reference used vLLM 0.8.5 and the migration tests used vLLM 0.18.1. The first run with V1 showed clear deviations in metrics like clip rate, kl_new_old, entropy, and reward. The team separated causes into layers and fixed them bottom-up: first semantics, then inference path, and only finally the objective part.
Layered diagnosis
They analyzed three classes of possible errors:
Semantic mismatch: the backend returns logprobs with a different meaning than the trainer expects.
Inference-path mismatch: differences in runtime defaults that make the same input follow different execution paths.
Objective mismatch: the objective needs fixes for staleness or async behavior.
At first they suspected the third category too soon. The key was to treat the first two as backend problems and rule them out first.
Observable symptoms
The mean policy ratio drifted far from 1.0 in the first V1.
The clip rate quickly showed the discrepancy between V0 and the initial V1.
The trainer saw logprobs and reward move away from the reference from early training stages.
These symptoms also show up in other online RL schemes like PPO or GRPO when the rollout uses logprobs that don't match what the trainer expects.
The four fixes that restored parity
Processing logprobs semantically
vLLM V1 by default delivered logprobs from the model's raw outputs, before post-processing operations (temperature, penalties, top-k/top-p). PipelineRL expected the logprobs as used by the sampler.
The required setting was:
logprobs-mode=processed_logprobs
This fixed the mean offset in rollout logprobs and centered the policy ratio around 1.0. But differences in clip rate, KL, entropy and downstream behavior remained.
Aligning the inference path and runtime defaults
The first run mixed the engine version with V1 runtime defaults (for example prefix caching and async scheduling). For the parity run the options were made explicit in the config:
Why disable prefix caching? It's usually a correct optimization for fixed-state inference, but in online RL the lifetime and reuse of cache in V1 differed from V0. A cache hit could reuse state computed before a weight update if the cache policy ignored the weight-update boundary. Disabling it removed a V1-only difference.
Matching the inflight weight-update flow
V0 implemented something like: block at an engine boundary, load new weights, and resume without explicitly invalidating cached state. To reproduce that in V1 they used:
The details mode="keep" and clear_cache=False replicated V0's inflight update semantics and removed persistent lag that appeared in the initial V1.
Final projection in fp32 for logits
The trainer was using an fp32 lm_head for the final projection. Small numerical differences in logits turn into logprob changes and those get amplified in policy ratios, KL and clipping. Including the lm_head path in fp32 closed the last numerical gap.
This same type of issue was reported in the MiniMax-M1 write-up and later literature, where computing the head in fp32 fixed training/inference mismatches.
What didn't work and why order matters
Applying only processed_logprobs fixed semantics, but didn't solve all differences. The run still showed lag and clipping discrepancies.
Treating the initial V1 as a baseline was a mistake: it had several V1 defaults enabled that confuse the comparison.
Fixing the objective without first fixing the backend would have mixed two different questions and hidden inference bugs.
The lesson is clear: fix backend correctness first, then add objective-level corrections if staleness or asynchrony still require them.
What's next and best practices for online RL
After restoring inference parity, objective-level improvements are reasonable. Some practical recommendations:
Keep the behavior policy's logprobs at rollout time.
Recompute old-policy logprobs in the trainer when optimizing, if feasible.
Keep backend mismatch fixes separate from policy-ratio fixes.
If you apply objective fixes before verifying inference equivalence, you risk masking a broken backend and losing important signals about why training changes.
Final reflection
This case is a technical and practical reminder: in online RL, inference accuracy is not a minor detail. Small semantic or numerical mismatches in logprobs and in cache and inflight-update handling translate into big behavior differences. The takeaway? Check backend parity first, document runtime defaults, and only then evaluate objective changes.