What was the core problem ServiceNow-AI encountered when upgrading its vLLM inference engine?

The core problem was a 'train-inference mismatch.' The new version of vLLM (V1) had different default behaviors for calculating token log probabilities, handling requests, and numerical precision compared to the older version. Since the Reinforcement Learning trainer in their PipelineRL system relies on these exact log probabilities to compute policy updates, the discrepancy broke the training process, causing key metrics to diverge significantly from the established baseline.

vLLM V0 to V1: Correctness Before Corrections in RL

ServiceNow-AI Details vLLM Migration, Highlighting Critical RL Inference Bugs

Engineers at ServiceNow-AI have published a detailed account of their migration from vLLM V0 to V1 within their PipelineRL framework, revealing a series of subtle but critical 'train-inference mismatches' that initially derailed their training process. The case study matters for any organization using online Reinforcement Learning (RL), as it demonstrates how seemingly minor discrepancies in an inference engine's backend—specifically how it computes token log probabilities—can significantly alter training dynamics and invalidate results. The team's methodical approach prioritized establishing backend correctness before attempting to apply any algorithmic corrections to the RL objective itself.

Isolating the Discrepancies

The initial migration attempt from vLLM 0.8.5 to 0.18.1 resulted in training metrics like KL divergence, clip rate, and reward diverging sharply from the established V0 baseline. Instead of treating this as an RL objective problem, the team correctly diagnosed it as a backend behavior issue. They isolated and fixed four distinct problems to restore parity between the trainer's expectations and the rollout generator's output:

Logprob Semantics: Configuring vLLM V1 to return log probabilities from the processed distribution used by the sampler (processed_logprobs), not from raw, pre-sampling model outputs.
Runtime Defaults: Explicitly disabling V1-only defaults like prefix caching and async scheduling to precisely match the execution path of the V0 reference run.
Weight Update Path: Aligning the inflight weight update mechanism in V1 to match the V0 behavior, which involved resuming generation without clearing cached state after loading new weights.
Numerical Precision: Forcing the final logit projection (lm_head) to run in fp32 precision on the inference backend to match the trainer's numerical path, a subtle issue also noted in other large-scale RL research.

The broader lesson from ServiceNow-AI's experience is a call for discipline in MLOps for complex AI systems. By fixing the underlying inference engine behavior first, they avoided a common pitfall: masking a fundamental correctness bug with a higher-level algorithmic patch. This approach ensures that training dynamics are interpretable and that any subsequent objective-side corrections for issues like policy staleness are built upon a foundation of reliable, correct backend behavior. This principle is essential as more enterprises deploy online RL systems where the feedback loop between inference and training is immediate and sensitive.

The ServiceNow-AI team's experience with the vLLM V1 migration underscores a critical MLOps principle for online RL systems: seemingly minor backend discrepancies in logprob computation, caching, and numerical precision are not just performance issues but correctness bugs that must be resolved before applying higher-level algorithmic corrections.

>> Verify Original Transmission at Hugging Face