Why does doubling the SFU speed for softmax result in a ~35% throughput gain, not a 100% gain?

Question

Accepted Answer

The overall inference throughput depends on the entire attention pipeline, which includes two major matrix multiplication steps performed by Tensor Cores in addition to the softmax calculation done by SFUs. While Blackwell Ultra doubles the speed of the softmax portion, the time spent on matrix multiplication remains the same. Therefore, the total speedup is an improvement on only one part of the overall process, leading to a measured ~35% gain in the forward propagation benchmark rather than a full doubling of end-to-end performance.

Making Softmax More Efficient with NVIDIA Blackwell Ultra