Boosting MoE Training Throughput with Advanced Fusion Kernels
By Jakub Antkiewicz
•2026-06-16T12:52:45Z
NVIDIA Targets MoE Training Efficiency with Fused Kernels
NVIDIA has released a new set of advanced fused MLP kernels designed with its CuTe DSL to accelerate the training of Mixture-of-Experts (MoE) models, a critical architecture for today's large-scale AI systems. The optimizations deliver significant performance gains by directly addressing system-level bottlenecks. In pre-training environments, these kernels have demonstrated up to an 8% end-to-end throughput improvement for DeepSeek-V3 and a substantial 93% improvement for GPT-OSS, highlighting their immediate value in reducing compute costs and training times for state-of-the-art models.
Addressing Core MoE Bottlenecks
The new kernels achieve their speedup by tackling three primary inefficiencies in MoE training: memory-bound activation functions, CPU synchronization overhead, and the computational cost of quantization. By fusing multiple operations into a single GPU kernel, NVIDIA eliminates the need for intermediate data to be written to and read from global memory, keeping the Tensor Cores consistently active. This hardware-aware software codesign enables more efficient execution paths for both forward and backward passes.
- Fused GLU Activations: Supports advanced functions like SwiGLU and GeGLU by repacking weights to compute outputs within the GEMM epilogue, avoiding memory round-trips.
- Sync-Free Execution: Manages token routing directly on the GPU, which removes CPU dependency and allows for the use of full-iteration NVIDIA CUDA Graphs.
- Integrated Quantization: Fuses low-precision quantization steps for formats like MXFP8 and NVFP4 directly into the kernel, removing a separate, memory-intensive pass.
- Dynamic Scheduling: Allows for efficient overlap with other operations, such as communication kernels used in expert and data parallelism.
Ecosystem Integration and Availability
These performance enhancements are being made accessible across the NVIDIA software ecosystem, allowing for broad adoption. Developers can integrate the fused kernels at various levels of abstraction, including directly from the cuDNN Frontend library, through the NVIDIA Transformer Engine, or as a configurable option in Megatron-Core. This tiered availability ensures that teams working at different points in the AI stack can leverage the optimizations. With kernel-level speedups reaching 1.3x to 2.1x, the impact on hardware utilization for platforms like the GB200 is considerable, promising more efficient scaling for the next generation of foundation models.
By delivering these MoE optimizations through its tightly integrated software stack—from cuDNN to Megatron-Core—NVIDIA is reinforcing the competitive moat around its hardware, making the full-stack ecosystem essential for achieving peak training performance.