TL;DR
Hugging Face reports successful alignment of vLLM V1 with V0 in terms of backend behavior, fixing key issues in logprobs, runtime defaults, and weight updates. This ensures accurate RL training metrics before further objective changes.
Hugging Face has confirmed that vLLM V1 now matches the behavior of vLLM V0 in reinforcement learning workflows after addressing four critical backend issues, prior to implementing any changes to the RL objective.
The company identified and fixed four primary issues: the processing of rollout logprobs, runtime default settings, inflight weight update handling, and the use of an fp32 lm_head for the final projection. These fixes eliminated discrepancies in key metrics such as clip rate, KL divergence, entropy, and reward, which initially diverged during V1 testing.
Initially, vLLM V1 returned raw model output logprobs, which did not match the processed distribution expected by the trainer, leading to training metric deviations. By setting logprobs-mode=processed_logprobs, the team aligned the logprobs output with expectations. Additionally, they standardized runtime defaults, such as disabling prefix caching and async scheduling, which previously caused inconsistencies. Weight update synchronization was also refined to mirror V0 behavior, avoiding cache reuse issues during online RL updates.
Why It Matters
This development is significant because it ensures the integrity of reinforcement learning training processes that depend on accurate logprobs and consistent inference behavior. Correct backend behavior is critical to reliable policy updates, and these fixes prevent training instability caused by mismatched metrics. The focus on correctness prior to objective modifications lays a stable foundation for future model improvements.
AI model training logprobs processing tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Background
The move from vLLM V0 to V1 was a major rewrite aimed at improving inference performance and flexibility. However, early tests revealed discrepancies in training metrics, particularly in logprobs and reward signals, which are sensitive to backend behavior. The initial V1 attempt showed divergence in key metrics such as clip rate and entropy, prompting a targeted investigation into potential causes, including semantic, inference-path, and objective mismatches. The team prioritized fixing the first two before addressing the objective, resulting in a stable baseline for subsequent modifications.
“We have aligned vLLM V1’s backend behavior with V0 by fixing logprobs processing, runtime defaults, and weight update handling, ensuring accurate RL training metrics before any objective changes.”
— Hugging Face team
“Correctness in the inference engine is essential for reliable reinforcement learning, and these fixes establish a solid foundation for future improvements.”
— Hugging Face developer

Optimization-Driven Deep Reinforcement Learning for Wireless Networks
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What Remains Unclear
It is still unclear how these backend fixes will influence subsequent modifications to the RL objective or whether similar issues could arise with future updates. The long-term impact on training stability remains to be observed as further objective-level changes are implemented.

Super Duper Publications | Inferencing Quick Take Along® | Educational Learning Resources for Children
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What’s Next
Next steps involve evaluating the impact of these backend fixes on actual RL training objectives, testing the stability of policy updates, and proceeding with planned modifications to the RL loss functions. Monitoring for any residual discrepancies or new issues will be critical as development continues.
AI model weight update synchronization software
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What specific issues were fixed in vLLM V1?
The team fixed four issues: logprobs processing mode, runtime default settings (prefix caching and async scheduling), inflight weight update handling, and the use of an fp32 lm_head for the final projection.
Why was it important to fix backend behavior before changing the RL objective?
Ensuring backend correctness is essential because discrepancies in logprobs and inference behavior can lead to unstable training metrics, which directly affect policy updates and model performance.
Will these fixes affect future model training or only the current setup?
The fixes establish a stable baseline for current training and are expected to improve the reliability of future RL training, though ongoing testing will determine if further adjustments are needed.
Are there still unresolved issues after these fixes?
While the core backend issues are addressed, it remains to be seen how these changes impact the broader training pipeline, and whether additional issues may emerge with further modifications to the RL objectives.