How does DFlash fundamentally differ from other speculative decoding methods like EAGLE-3?

While methods like EAGLE-3 use an autoregressive draft model that still generates candidate tokens one by one, DFlash uses a block-diffusion drafter. This key difference allows DFlash to propose an entire block of future tokens in a single parallel forward pass, making it more efficient for hardware like NVIDIA 's Blackwell GPUs which excel at parallel computation.

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

A new open-source speculative decoding model called DFlash is delivering significant inference performance gains on NVIDIA's Blackwell architecture. By drafting entire blocks of tokens in parallel, the technique has demonstrated up to a 15x throughput improvement for large models like gpt-oss-120b. This development directly addresses the growing demand for low-latency inference required by increasingly complex and interactive multi-agent AI workflows.

Technical Breakdown: From Sequential to Parallel Drafting

Unlike traditional speculative decoding methods that generate candidate tokens sequentially, DFlash employs a lightweight block-diffusion drafter. This allows it to propose multiple future tokens in a single forward pass, which the larger target model then verifies in parallel. This approach shifts the workload from a memory-bound, sequential process to a more compute-intensive parallel task, better utilizing the capabilities of the Blackwell GPUs. Key performance metrics highlight its efficiency:

Up to 15x higher throughput for gpt-oss-120b on an NVIDIA DGX B300 system compared to autoregressive decoding.
Nearly doubles interactivity for Llama 3.1 8B compared to the EAGLE-3 speculative decoding model.
Throughput speedups of up to 5.8x for Gemma 4 31B on vLLM and 5.1x for Qwen3 8-B on SGLang.

Ecosystem Integration and Developer Impact

The rapid integration of DFlash into major inference frameworks like SGLang, vLLM, and TensorRT-LLM signals a key advantage for developers. With model checkpoints already available on Hugging Face for architectures like Blackwell and Hopper, teams can adopt this optimization without significant code refactoring. This quick path from academic research, which originated at UC San Diego, to production-ready tools underscores the NVIDIA ecosystem's ability to propagate performance enhancements and lower the barrier for deploying more responsive and efficient AI applications.

DFlash's swift adoption into core inference libraries shows the industry is moving beyond model-centric optimizations and is now focused on fundamentally re-architecting the token generation process. By converting the sequential decoding bottleneck into a parallel compute problem, this technique directly leverages the architectural strengths of new hardware like NVIDIA Blackwell, making low-latency agentic systems more operationally viable.

>> Verify Original Transmission at NVIDIA