What specific performance improvements does NVIDIA claim for its end-to-end FP8 RL training method?

For linear layers, the end-to-end FP8 recipe provides a consistent training throughput improvement of over 15% compared to a BF16 baseline on models like Llama 3.1 8B. When FP8 is also applied to the KV cache and attention mechanisms, it adds an approximate 30% speedup to the rollout stage, resulting in a total overall speedup of around 48% compared to the BF16 baseline.

Run High-Throughput Reinforcement Learning Training with End-to-End FP8 Precision

NVIDIA Details End-to-End FP8 Method to Speed Up RL Training

NVIDIA researchers have detailed a new method using end-to-end FP8 precision within the open-source NVIDIA NeMo RL framework to accelerate the compute-intensive workloads of Reinforcement Learning (RL). This development addresses a significant bottleneck in advancing Large Language Models (LLMs) from simple text generation to complex reasoning, a transition heavily reliant on RL algorithms like Group Relative Policy Optimization (GRPO) for iterative improvement.

The technique applies a block-wise quantized FP8 recipe to a model's linear layers during both the latency-sensitive generation phase, often handled by engines like vLLM, and the high-throughput training phase within frameworks like Megatron Core. This end-to-end consistency is critical for reducing numerical disagreement that can arise in low-precision, multi-engine pipelines. Key technical aspects include:

Precision Method: End-to-end FP8 (E4M3) for linear layers in both generation and training.
Numerical Alignment: Using FP8 consistently across the RL loop reduces token multiplicative probability error compared to mixed-precision approaches.
Accuracy Parity: Importance sampling is employed to correct for distributional mismatches, enabling the FP8 method to achieve validation accuracy on par with the standard BF16 baseline.
Extended Optimization: The technique also extends FP8 to the KV cache and attention mechanisms, using a dynamic recalibration process to handle constantly updating policy weights.

In performance evaluations on a Llama 3.1 8B Instruct model, the end-to-end FP8 recipe demonstrated a training throughput improvement of over 15% while matching the validation accuracy of the BF16 baseline. Extending FP8 to the KV cache and attention layers provided an additional ~30% speedup in the rollout stage. For Mixture-of-Experts (MoE) models like Qwen3-30B, the FP8 approach also achieved comparable accuracy curves. These efficiency gains make the computationally demanding process of RL training more operationally viable, which could accelerate the development cycles for more capable agentic AI systems.

By standardizing on a lower-precision format across the entire Reinforcement Learning pipeline, NVIDIA is addressing the practical compute-cost and throughput barriers to scaling model reasoning, shifting focus from theoretical performance to deployable, iterative improvement.

>> Verify Original Transmission at NVIDIA