How does the NVIDIA Vera Rubin Platform split tasks between the Vera Rubin NVL72 GPU and the Groq 3 LPX LPU?

The platform uses a technique called Attention-FFN Disaggregation (AFD), orchestrated by NVIDIA Dynamo. The throughput-intensive prefill and attention calculations are handled by the Vera Rubin NVL72 GPUs, which excel at large-batch operations. The latency-critical Feed-Forward Network (FFN) execution for each new token is offloaded to the Groq 3 LPX LPUs, which are designed for deterministic, low-jitter performance on sequential, small-batch tasks.

How the NVIDIA Vera Rubin Platform is Solving Agentic AI’s Scale-Up Problem

NVIDIA has detailed a new architecture designed to address the unique performance demands of agentic AI, where complex tasks generate hundreds of unpredictable inference requests per session. The NVIDIA Vera Rubin Platform combines the company's powerful Vera Rubin NVL72 compute engine with a new, specialized accelerator, the NVIDIA Groq 3 LPX. This pairing is engineered to provide the sustained low-latency and high-throughput required for trillion-parameter Mixture-of-Experts (MoE) models, a combination that has been difficult to serve economically with existing hardware.

The platform's design targets the core bottleneck in agentic workloads: network variability across the many chips needed to run large models. Conventional networking is ill-suited for the small batches and long context windows common in multi-agent systems. The Groq 3 LPX addresses this with a deterministic chip-to-chip (C2C) interconnect that uses compiler-scheduled data movement and plesiosynchronous timing to make thousands of accelerators operate as a single, low-jitter system. This approach avoids the runtime contention that degrades performance in traditional fabrics, ensuring predictable latency as the system scales.

Solving Inference with a Heterogeneous Engine

This architecture introduces a heterogeneous compute model orchestrated by a software layer called NVIDIA Dynamo. Using a technique named Attention-FFN Disaggregation (AFD), tasks are split between the two types of processors. The Vera Rubin NVL72 GPUs handle the throughput-dominated prefill and attention calculations, while the Groq 3 LPX LPUs accelerate the latency-sensitive FFN decode loop. This division of labor allows each component to excel at its intended task, delivering what NVIDIA claims is up to 35x higher throughput per megawatt than the previous generation NVIDIA GB200 NVL72 and unlocking significantly more revenue potential for premium agentic AI services.

Performance Target: 400 tokens per second per user on trillion-parameter MoE models with 400K-token context.
NVIDIA Groq 3 LPX Rack: Provides up to 640 TB/s of scale-up bandwidth and a 128 GB unified SRAM pool across 256 LPUs.
NVIDIA Vera Rubin NVL72 Rack: Delivers up to 3,600 PFLOPS of NVFP4 compute, 20.7 TB of HBM4, and 1.6 PB/s of memory bandwidth.

NVIDIA's integration of Groq's specialized LPU architecture into its flagship Vera Rubin platform signals a strategic shift toward heterogeneous computing, acknowledging that dominating the next wave of low-latency agentic AI requires moving beyond a pure-GPU approach to embrace purpose-built silicon for specific parts of the inference pipeline.

>> Verify Original Transmission at NVIDIA