Running Large-Scale GPU Workloads on Kubernetes with Slurm
By Jakub Antkiewicz
•2026-04-14T09:17:47Z
NVIDIA has detailed its use of Slinky, an open-source project that runs the Slurm job scheduler natively on Kubernetes, to manage large-scale AI training clusters with over 8,000 GPUs. The effort, centered on the Slinky slurm-operator, addresses a critical challenge for organizations with deep investments in traditional high-performance computing (HPC) workflows: how to leverage the operational benefits of cloud-native infrastructure without abandoning established Slurm-based systems. By containerizing Slurm components and managing them with Kubernetes, NVIDIA is demonstrating a production-ready path for unifying two historically separate technology stacks.
Technically, the slurm-operator represents each Slurm daemon as a Kubernetes Custom Resource Definition (CRD), allowing the entire Slurm cluster lifecycle to be managed declaratively. The integration extends to the broader cloud-native ecosystem, utilizing the NVIDIA GPU Operator for automated driver and telemetry setup, and Prometheus for unified observability. For advanced hardware like the GB200 NVL72, Slinky uses Kubernetes abstractions like Dynamic Resource Allocation (DRA) and ComputeDomains to manage multinode NVLink topologies, ensuring that distributed training jobs achieve full interconnect bandwidth. The recent v1.1.0 release further enhances this with dynamic topology discovery and more robust worker node management.
This approach signals a significant move toward consolidating AI infrastructure management. By bringing Slurm into the Kubernetes fold, platform teams can use familiar tools like kubectl, Helm, and Grafana to operate HPC workloads, eliminating the need for separate management toolchains. The reported benefits at NVIDIA—including nondisruptive rolling updates, automated remediation of unhealthy nodes, and performance parity with non-containerized deployments—provide a compelling model for other large-scale AI operators. It effectively lowers the barrier for traditional HPC shops to adopt modern infrastructure practices while preserving their existing scheduling logic and user workflows.
The Slinky project provides a strategic bridge for modernizing AI infrastructure, allowing organizations to retain years of investment in Slurm workflows while gaining the operational efficiencies and ecosystem benefits of Kubernetes. It treats HPC scheduling not as a legacy system to be replaced, but as a first-class workload within the cloud-native paradigm.