Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform
By Jakub Antkiewicz
•2026-03-17T08:54:28Z
NVIDIA has introduced the Groq 3 LPX, a rack-scale accelerator designed specifically for low-latency inference within its next-generation Vera Rubin platform. The system is engineered to address the performance demands of emerging agentic AI, where near-instantaneous response times are critical for real-time interaction and complex reasoning. By integrating the LPX alongside the general-purpose Vera Rubin NVL72, NVIDIA is creating a heterogeneous computing environment, acknowledging that the requirements for training and high-throughput batch processing are diverging from the needs of interactive, user-facing applications.
The Groq 3 LPX system is built around 256 interconnected NVIDIA Groq 3 LPU accelerators, which eschew traditional hardware-managed caches in favor of a large, 128 GB total on-chip SRAM capacity. This design choice, combined with a compiler that explicitly orchestrates data movement and computation, aims to deliver deterministic execution with minimal jitter. In this arrangement, the LPX handles the latency-sensitive portions of inference, such as the token decode loop, while the more powerful Vera Rubin NVL72 GPUs manage computationally intensive tasks like prefill and attention mechanisms for long contexts. This division of labor is intended to optimize the entire AI serving pipeline for both speed and efficiency.
This announcement signals a significant maturation in the AI hardware market, moving beyond a one-size-fits-all approach to acceleration. By offering a specialized component for low-latency serving, NVIDIA is enabling data center operators to build more efficient and cost-effective infrastructure tailored to a mixed workload of background AI tasks and interactive agentic systems. This architectural split directly addresses the bottlenecks in today's serving models, particularly as AI-powered applications increasingly rely on long chains of thought and continuous, high-speed token generation to deliver more sophisticated user experiences.
NVIDIA's introduction of the specialized Groq 3 LPX alongside its general-purpose Vera Rubin platform marks a strategic commitment to heterogeneous data center architectures, recognizing that a single accelerator design cannot efficiently serve the divergent demands of high-throughput training and ultra-low-latency agentic inference.