Does using fractional GPUs significantly increase latency for end-users?

No, the benchmark results indicate a modest impact on latency. For example, tests running on a half-GPU fraction maintained a time-to-first-token (TTFT) under one second, meeting a common service-level agreement for interactive AI applications. The system demonstrated stable performance with no latency cliffs or error spikes during scale-out events.

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

NVIDIA, in partnership with AI cloud provider Nebius, has released benchmark data demonstrating that its Run:ai software can substantially improve the efficiency of large language model (LLM) inference. The results show that by using dynamic GPU fractioning, a technique that divides a single GPU's resources, enterprises can support up to 86% of the concurrent users of a full GPU while allocating only half of the hardware. This directly addresses the persistent and costly problem of underutilized GPU capacity in production AI systems, where models often occupy an entire accelerator despite sporadic traffic.

The joint tests were conducted on both an on-premises cluster with NVIDIA H100 NVL GPUs and a Nebius AI Cloud cluster using NVIDIA HGX B200 GPUs. For the Llama 3.1 8B model, a 0.5 GPU fraction delivered 77% of the full GPU's token throughput while maintaining a time-to-first-token (TTFT) under one second. The tests also revealed that smaller models like Phi-4-Mini benefit even more, with quarter-GPU fractions supporting up to 72% more concurrent users than a dedicated full-GPU allocation. The Run:ai intelligent scheduler manages these co-located workloads by enforcing memory isolation and distributing compute cycles based on priority and demand.

These findings signal a shift in how organizations can approach AI infrastructure management, moving GPU partitioning from a niche optimization to a core operational strategy. For businesses, this provides a clear method to increase the return on expensive hardware investments by running multiple, diverse models on a shared pool of GPUs. The ability to dynamically scale resources up or down based on workload demand lowers the total cost of ownership and reduces the operational complexity of manually allocating GPUs, making the deployment of sophisticated, multi-model AI services more economically viable.

The benchmarks confirm that a primary constraint in scaling LLM inference is not just raw hardware availability, but intelligent resource orchestration. Dynamic GPU fractioning alters the economic calculus of production AI, enabling organizations to unlock significant latent capacity within their existing infrastructure instead of defaulting to acquiring more accelerators.