What is the core technology that enables MiniMax M3's efficiency at a 1M-token context length?

MiniMax M3's efficiency comes from its core architectural innovation, MiniMax Sparse Attention (MSA). Unlike standard attention which has quadratic complexity, MSA uses a pre-filtering stage to identify only the most relevant context blocks. It then performs attention calculations exclusively on these blocks, allowing for contiguous memory access that is over 4x faster and reducing per-token compute costs by a factor of 20 at 1M context, all without sacrificing precision by compressing key-values.

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

MiniMax M3 Unifies Multimodal Workflows on NVIDIA Infrastructure

MiniMax has announced its MiniMax M3 model is now available on NVIDIA accelerated infrastructure, offering a unified system for long-context reasoning and agentic workflows. This integration directly addresses the operational challenges developers face when managing separate models for text, vision, and code. By providing a single, 428B parameter Mixture-of-Experts (MoE) model capable of handling up to 1 million tokens and native multimodal inputs, the collaboration aims to reduce pipeline complexity and accelerate development cycles for applications ranging from long-form video analysis to extended coding sessions.

Technical Architecture and Performance

The model's performance relies on an architectural innovation called MiniMax Sparse Attention (MSA), which replaces standard quadratic attention. MSA uses a pre-filtering stage to identify and attend only to relevant context blocks, enabling significantly more efficient memory access. According to the company, this results in a 20x reduction in per-token compute cost at the 1M context length, along with 9x faster prefill and 15x faster decoding speeds. This efficiency is achieved without compressing key-values, thus maintaining full precision.

Total Parameters: 428B (MoE)
Active Parameters: 22B per token
Context Length: 1,000,000 tokens
Input Modalities: Native support for video, image, and text
Attention Mechanism: MiniMax Sparse Attention (MSA)
Expert Configuration: 128 total experts, 4 activated per token

For the developer ecosystem, the release is supported by a comprehensive suite of NVIDIA tools. Open source inference is available through libraries like NVIDIA TensorRT LLM, SGLang, and vLLM. For large-scale production environments, NVIDIA Dynamo offers a disaggregated inference serving platform. Further customization and reinforcement learning are supported via the NVIDIA NeMo Framework, which includes context parallelism for sequence lengths up to 128k tokens, providing a complete pathway from experimentation to scaled deployment.

The collaboration between MiniMax and NVIDIA signals a market direction where foundational model performance is intrinsically tied to optimized, full-stack hardware and software ecosystems, moving the competitive bottleneck from model architecture alone to efficient, large-scale deployment.

>> Verify Original Transmission at NVIDIA