Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation
By Jakub Antkiewicz
•2026-06-11T12:11:46Z
Google and NVIDIA Target AI Latency with Parallel Text Generation
Google DeepMind, in collaboration with NVIDIA, has released an optimized version of its DiffusionGemma model designed to run efficiently across NVIDIA's hardware stack. The model addresses the inherent latency of sequential, token-by-token generation common in autoregressive models. By generating text tokens in parallel, DiffusionGemma aims to improve throughput and responsiveness for real-time AI applications such as interactive agents and copilots, where speed is a critical operational factor.
Technical Architecture and Performance
DiffusionGemma utilizes a diffusion-based denoising process to produce 256 tokens in a single step, a fundamental departure from the one-at-a-time method of most large language models. This allows for significant performance gains, with benchmarks indicating throughput of up to 1,000 tokens per second on a single NVIDIA H100 GPU. The model is built on the Gemma 4 26B Mixture-of-Experts (MoE) architecture and is optimized for memory-bound inference tasks.
- Architecture: Gemma 4 26B A4B MoE
- Active Parameters: 3.8B
- Context Length: Up to 256K tokens
- Supported Precision: BF16, NVFP4
- Supported Hardware: NVIDIA H100, DGX Spark, DGX Station, RTX / RTX PRO
Developer Ecosystem and Deployment
NVIDIA has integrated DiffusionGemma into its full software stack, providing a clear path from development to production. Developers can begin prototyping on Hugging Face, utilize vLLM for high-throughput serving, and fine-tune the model for specific tasks using the NVIDIA NeMo AutoModel library. For enterprise-grade deployments, the model is available as a containerized microservice through NVIDIA NIM, which includes standardized, OpenAI-compatible APIs to simplify integration into existing infrastructure.
Strategic Takeaway: NVIDIA's comprehensive support for DiffusionGemma is a strategic move to standardize the MLOps pipeline for emerging non-autoregressive models. By providing a unified path from local RTX development to cloud deployment with NIM, NVIDIA is positioning its ecosystem as the default infrastructure for productionizing alternative AI architectures, effectively reducing the friction for enterprises to adopt high-throughput, lower-latency text generation.