AiPhreaks ← Back to News Feed

Making Softmax More Efficient with NVIDIA Blackwell Ultra

By Jakub Antkiewicz

2026-02-26T08:46:08Z

NVIDIA's new Blackwell Ultra architecture is engineered to address a growing performance bottleneck in large language models: the softmax function. As AI models adopt longer context lengths and more complex attention mechanisms, the transcendental math executed by Special Function Units (SFUs) for softmax calculations is increasingly limiting inference speed. Blackwell Ultra directly targets this issue by doubling SFU throughput for the natural exponential function, a core component of softmax, aiming to reduce pipeline stalls that leave powerful Tensor Cores idle.

The attention mechanism in transformers involves a sequential pipeline where matrix multiplications calculate scores, softmax normalizes them, and a final matrix multiplication aggregates context. On standard Blackwell GPUs (GB200), the Tensor Cores must wait for the SFUs to complete the softmax step, creating a performance gap. By doubling the throughput of the MUFU.EX2 instruction, Blackwell Ultra systems like the GB300 can process this normalization step nearly twice as fast. Synthetic micro-benchmarks confirm this hardware improvement, which translates to a measured ~35% increase in FP8 forward propagation throughput for models such as DeepSeek-V3.

This architectural enhancement reflects a significant shift in AI accelerator design, moving beyond a singular focus on matrix math to a more holistic approach that optimizes the entire operational pipeline. By alleviating the softmax constraint, Blackwell Ultra allows for higher utilization of the GPU's matrix engines, leading to greater overall inference efficiency. This development underscores the increasing importance of hardware-software co-design, where targeted hardware accelerations are becoming critical for unlocking performance in next-generation AI systems that rely heavily on both linear and non-linear operations.

NVIDIA's focus on accelerating specialized non-linear functions in Blackwell Ultra demonstrates that the frontier of AI hardware performance is shifting from raw matrix compute power to the targeted elimination of specific, function-level bottlenecks within complex model architectures.