vLLM V1 fixes logprobs and reaches parity with V0

vLLM V1 matches the behavior of vLLM V0 in an RL experiment after four key fixes: processed_logprobs, V1-specific runtime defaults, the inflight weight-update path, and the fp32 lm_head for the final projection. Before touching the RL objective, the team fixed the backend and then evaluated objective-level changes.

What happened in a nutshell

Why did this change matter so much? Because in online RL systems, the logprobs coming from the backend are a direct part of the objective function. If those logprobs don't mean the same thing to the trainer and the rollout backend, optimization becomes biased.

In the cited experiment, the reference used vLLM 0.8.5 and the migration tests used vLLM 0.18.1. The first run with V1 showed clear deviations in metrics like clip rate, kl_new_old, entropy, and reward. The team separated causes into layers and fixed them bottom-up: first semantics, then inference path, and only finally the objective part.

What happened in a nutshell

Layered diagnosis

Observable symptoms

The four fixes that restored parity

What didn't work and why order matters

What's next and best practices for online RL

Final reflection

Original source

Stay up to date!

vLLM V1 fixes logprobs and reaches parity with V0