AiPhreaks ← Back to News Feed

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer

By Jakub Antkiewicz

2026-06-27T10:07:48Z

NVIDIA Details High-Performance NVFP4 Quantization for Nemotron 3 Ultra

NVIDIA has provided a technical deep-dive into the creation of its Nemotron 3 Ultra NVFP4 checkpoint, detailing the specific optimization techniques used to shrink the 550B parameter model without sacrificing accuracy. The process, executed with the NVIDIA Model Optimizer, offers a practical guide for developers aiming to leverage the new 4-bit floating point format native to the Blackwell architecture. The resulting quantized model achieves a 3.2x size reduction, shrinking from 1,121 GB to 352.3 GB, and delivers up to 5.9x higher inference throughput on certain workloads compared to similarly quantized large models.

The key to maintaining accuracy was a mixed-precision strategy, where different model layers were quantized to different formats based on their sensitivity. While critical components like MoE routed experts were compressed to NVFP4, other sensitive layers remained in BF16 or FP8. Finding the optimal quantization recipe involved moving beyond simple scaling methods. Researchers experimented with several approaches to determine how to map model weights to NVFP4's limited set of representable values.

Finding the Optimal Quantization Recipe

The team explored several calibration strategies to minimize information loss during quantization:

  • Max (absmax) Scaling: A simple method where the scale is set by the single largest value in a block of weights, which can be inefficient in the presence of outliers.
  • Mean Squared Error (MSE) Scaling: Searches for a scale that minimizes the average reconstruction error across a block, but lower error doesn't always correlate to better downstream task accuracy.
  • Four-over-six Scaling: The method ultimately used for Nemotron 3 Ultra's routed experts. It dynamically chooses to scale weight blocks to a maximum of either 4 or 6, minimizing error caused by a large representational gap in the NVFP4 format.

This methodical approach, automated by tools like NVIDIA Model Optimizer's `auto_quantize` function, led to an optimal effective bits-per-element (BPE) of 5.03, balancing model compression with performance on key benchmarks. The resulting checkpoint is also cross-compatible, running with native W4A4 on Blackwell and automatically converting to W4A16 on older Hopper GPUs.

By publishing its methodology, NVIDIA is doing more than releasing a model; it's providing a strategic blueprint for the entire ecosystem. This move demystifies advanced quantization, turning it from a specialized research problem into an accessible engineering practice and ensuring that developers can extract maximum performance from both new and existing hardware, further cementing its software and hardware integration.
End of Transmission
Scan All Nodes Access Archive