AiPhreaks ← Back to News Feed

Running Large-Scale GPU Workloads on Kubernetes with Slurm

By Jakub Antkiewicz

2026-04-10T09:10:45Z

NVIDIA is now running production Slurm clusters with over 8,000 GPUs on Kubernetes using its open-source Slinky project, effectively bridging the gap between traditional high-performance computing (HPC) and cloud-native infrastructure. The project, originating from Slurm developer SchedMD before its acquisition by NVIDIA, provides a way to manage the dominant HPC job scheduler directly within Kubernetes. This move addresses a significant operational challenge for organizations invested in Slurm workflows that also want to standardize on Kubernetes for managing large-scale GPU resources.

The core of the implementation is the Slinky `slurm-operator`, which represents each Slurm component—such as the control, accounting, and worker daemons—as a native Kubernetes Custom Resource. This allows platform teams to deploy and manage a full Slurm cluster using familiar declarative YAML configurations and Helm charts. The system integrates directly with the NVIDIA GPU Operator to automate GPU driver management and provides per-job telemetry. For advanced multi-node architectures like the GB200 NVL72, Slinky uses Kubernetes abstractions like ComputeDomains to dynamically manage NVLink connectivity, enabling topology-aware scheduling for distributed training jobs.

By moving Slurm onto Kubernetes, the operational burden of managing two separate ecosystems is substantially reduced. According to NVIDIA's production experience, this model achieves performance parity with non-containerized Slurm deployments. More importantly, it allows for unified monitoring via Prometheus and Grafana, non-disruptive rolling updates for Slurm versions or OS patches, and coordinated maintenance using standard `kubectl` commands. This consolidation simplifies operations and allows engineering teams to manage complex AI training infrastructure with a single, consistent set of tools.

NVIDIA's internal adoption of Slinky to manage thousands of production GPUs signals a deliberate strategy to standardize large-scale AI infrastructure on Kubernetes. By treating the industry-standard Slurm scheduler as a native Kubernetes application, the company is providing a practical migration path for established HPC environments and cementing Kubernetes as the unified operational layer for enterprise AI.