What is 'four-over-six' scaling and why was it important for the Nemotron 3 Ultra model?

Four-over-six scaling is an advanced quantization technique that addresses a specific weakness in the NVFP4 data format, which has a large representational gap between the values 4 and 6. For each block of model weights, this method intelligently chooses whether to scale the values to a maximum of 4 or a maximum of 6, selecting whichever option minimizes the reconstruction error. This was critical for Nemotron 3 Ultra because it cut the median reconstruction error by 16.4% compared to standard methods, delivering the best accuracy on downstream tasks for the model's MoE expert layers.

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer

NVIDIA Details High-Performance NVFP4 Quantization for Nemotron 3 Ultra

NVIDIA has provided a technical deep-dive into the creation of its Nemotron 3 Ultra NVFP4 checkpoint, detailing the specific optimization techniques used to shrink the 550B parameter model without sacrificing accuracy. The process, executed with the NVIDIA Model Optimizer, offers a practical guide for developers aiming to leverage the new 4-bit floating point format native to the Blackwell architecture. The resulting quantized model achieves a 3.2x size reduction, shrinking from 1,121 GB to 352.3 GB, and delivers up to 5.9x higher inference throughput on certain workloads compared to similarly quantized large models.

The key to maintaining accuracy was a mixed-precision strategy, where different model layers were quantized to different formats based on their sensitivity. While critical components like MoE routed experts were compressed to NVFP4, other sensitive layers remained in BF16 or FP8. Finding the optimal quantization recipe involved moving beyond simple scaling methods. Researchers experimented with several approaches to determine how to map model weights to NVFP4's limited set of representable values.

Finding the Optimal Quantization Recipe

The team explored several calibration strategies to minimize information loss during quantization:

Max (absmax) Scaling: A simple method where the scale is set by the single largest value in a block of weights, which can be inefficient in the presence of outliers.
Mean Squared Error (MSE) Scaling: Searches for a scale that minimizes the average reconstruction error across a block, but lower error doesn't always correlate to better downstream task accuracy.
Four-over-six Scaling: The method ultimately used for Nemotron 3 Ultra's routed experts. It dynamically chooses to scale weight blocks to a maximum of either 4 or 6, minimizing error caused by a large representational gap in the NVFP4 format.

This methodical approach, automated by tools like NVIDIA Model Optimizer's `auto_quantize` function, led to an optimal effective bits-per-element (BPE) of 5.03, balancing model compression with performance on key benchmarks. The resulting checkpoint is also cross-compatible, running with native W4A4 on Blackwell and automatically converting to W4A16 on older Hopper GPUs.

By publishing its methodology, NVIDIA is doing more than releasing a model; it's providing a strategic blueprint for the entire ecosystem. This move demystifies advanced quantization, turning it from a specialized research problem into an accessible engineering practice and ensuring that developers can extract maximum performance from both new and existing hardware, further cementing its software and hardware integration.

>> Verify Original Transmission at NVIDIA