Does training with the 4-bit NVFP4 format significantly reduce the final model's accuracy compared to FP8?

No. According to NVIDIA's benchmarks on a Llama 3 8B model, the NVFP4 training recipe tracks the FP8 baseline loss curve almost perfectly over 10,000 steps. The company reports 'no measurable accuracy cost,' indicating the techniques used successfully preserve convergence while increasing training throughput.

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

NVIDIA Details NVFP4 Recipe for Faster LLM Training on Blackwell

NVIDIA has detailed a new 4-bit training recipe using its NVFP4 format, enabling significant performance gains for large language model pre-training on its Blackwell and upcoming Rubin hardware platforms. Implemented within the JAX framework via the MaxText library and TransformerEngine, the technique delivers up to a 1.73x throughput increase over established FP8 baselines. This development is critical as it directly targets the high cost and lengthy timelines associated with training frontier AI models, offering a method to accelerate the process without a measurable penalty to final model accuracy.

The core of the NVFP4 recipe is a combination of five specialized techniques designed to maintain numerical stability and convergence at sub-byte precision. This approach avoids the common pitfalls of low-bit training by strategically applying quantization only to the MLP layers of a transformer, where most compute occurs, while leaving sensitive attention blocks at higher precision. The key components include:

16-element micro block scaling for finer-grained outlier management.
E4M3 block scale factors for more expressive scaling than power-of-two methods.
A selective Random Hadamard Transform (RHT) applied only to weight-gradient inputs to normalize outliers.
2D weight scaling to ensure consistency between forward and backward passes.
Stochastic rounding to prevent small gradient updates from being lost during quantization.

Performance benchmarks demonstrate substantial real-world benefits. When pre-training a Llama 3.1 405B model on a GB300 Grace Blackwell Ultra Superchip, the NVFP4 recipe achieved a 1.73x speedup compared to an FP8 baseline. For AI labs and enterprises, this acceleration translates directly into lower training costs and a faster path to deploying new models. The release reinforces NVIDIA's strategy of pairing its hardware advancements with a deeply integrated software ecosystem, creating a powerful, optimized stack that is difficult for competitors to replicate and essential for developers working at the cutting edge of AI.

The release of the NVFP4 training recipe shows that NVIDIA's primary competitive advantage is not just its silicon, but its vertically integrated ecosystem; by providing fine-tuned software solutions for frameworks like JAX, it ensures its hardware's advanced features are immediately usable, maximizing performance and solidifying developer loyalty.

>> Verify Original Transmission at NVIDIA