vLLM V0 to V1: Correctness Before Corrections in RL
By Jakub Antkiewicz
•2026-05-07T10:24:19Z
ServiceNow-AI Details vLLM Migration, Highlighting Critical RL Inference Bugs
Engineers at ServiceNow-AI have published a detailed account of their migration from vLLM V0 to V1 within their PipelineRL framework, revealing a series of subtle but critical 'train-inference mismatches' that initially derailed their training process. The case study matters for any organization using online Reinforcement Learning (RL), as it demonstrates how seemingly minor discrepancies in an inference engine's backend—specifically how it computes token log probabilities—can significantly alter training dynamics and invalidate results. The team's methodical approach prioritized establishing backend correctness before attempting to apply any algorithmic corrections to the RL objective itself.
Isolating the Discrepancies
The initial migration attempt from vLLM 0.8.5 to 0.18.1 resulted in training metrics like KL divergence, clip rate, and reward diverging sharply from the established V0 baseline. Instead of treating this as an RL objective problem, the team correctly diagnosed it as a backend behavior issue. They isolated and fixed four distinct problems to restore parity between the trainer's expectations and the rollout generator's output:
- Logprob Semantics: Configuring vLLM V1 to return log probabilities from the processed distribution used by the sampler (
processed_logprobs), not from raw, pre-sampling model outputs. - Runtime Defaults: Explicitly disabling V1-only defaults like prefix caching and async scheduling to precisely match the execution path of the V0 reference run.
- Weight Update Path: Aligning the inflight weight update mechanism in V1 to match the V0 behavior, which involved resuming generation without clearing cached state after loading new weights.
- Numerical Precision: Forcing the final logit projection (
lm_head) to run in fp32 precision on the inference backend to match the trainer's numerical path, a subtle issue also noted in other large-scale RL research.
The broader lesson from ServiceNow-AI's experience is a call for discipline in MLOps for complex AI systems. By fixing the underlying inference engine behavior first, they avoided a common pitfall: masking a fundamental correctness bug with a higher-level algorithmic patch. This approach ensures that training dynamics are interpretable and that any subsequent objective-side corrections for issues like policy staleness are built upon a foundation of reliable, correct backend behavior. This principle is essential as more enterprises deploy online RL systems where the feedback loop between inference and training is immediate and sensitive.
The ServiceNow-AI team's experience with the vLLM V1 migration underscores a critical MLOps principle for online RL systems: seemingly minor backend discrepancies in logprob computation, caching, and numerical precision are not just performance issues but correctness bugs that must be resolved before applying higher-level algorithmic corrections.