Why is the standard topology/tree plugin in Slurm not sufficient for the NVIDIA GB200 NVL72?

The topology/tree plugin treats the network as a best-effort hierarchy and can fragment jobs across different network switches to reduce queue times. For the NVIDIA GB200 NVL72, workloads that cross the boundaries of its rack-scale NVLink domain suffer a severe performance drop. The new topology/block plugin is required because it treats these NVLink domains as rigid, atomic scheduling units, preventing this performance-degrading fragmentation and ensuring jobs stay within the high-speed coherent memory fabric.

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling

Slurm Block Scheduling Released to Tame NVIDIA's Rack-Scale GB200 NVL72

In a joint effort to address the unique architectural demands of rack-scale GPU computing, NVIDIA and SchedMD have introduced a critical update to the Slurm workload manager. The new `topology/block` plugin, introduced in Slurm 23.11 and enhanced since, is designed specifically for systems like the NVIDIA GB200 NVL72. This is significant because the GB200 NVL72 extends a coherent NVLink memory domain across an entire rack, creating a performance cliff for any workload that crosses this boundary. The new plugin moves away from best-effort scheduling to enforce rigid, topology-aware job placement, preventing the performance degradation and system fragmentation that would otherwise hinder these exascale systems.

The technical challenge stems from the GB200 NVL72's design, which unifies 72 Blackwell GPUs across a rack with fifth-generation NVLink. While communication within this domain is exceptionally fast, traffic crossing into standard InfiniBand or Ethernet fabrics experiences a sharp drop in speed. The `topology/block` plugin addresses this by treating each NVLink domain as an atomic scheduling block. Administrators can define these blocks in a `topology.yaml` file, and users can leverage the `--segment` argument to specify their application's precise locality needs. This allows jobs to be split across multiple blocks if the workload permits, balancing strict performance requirements with scheduler efficiency.

System: NVIDIA GB200 NVL72
GPU Count: 72 Blackwell GPUs per rack
Interconnect: Fifth-generation NVLink
Aggregate Bandwidth: 130 TB/s within the NVLink domain
Scheduler Plugin: Slurm `topology/block`

This development directly impacts how large-scale AI clusters are managed, moving the industry toward a model where hardware architecture and software orchestration are tightly co-designed. For data center operators, it provides the necessary tools to move from prototype clusters to production-grade environments, ensuring consistent, high-performance operation. By integrating features like support for incomplete blocks and driver-level GPU isolation via the NVIDIA IMEX plugin, Slurm now offers a robust solution for managing the complexity of these powerful systems. This shift ensures that the theoretical performance of rack-scale hardware can be reliably achieved in practice, preventing resource fragmentation and maximizing workload throughput.

The introduction of rigid block scheduling for the GB200 NVL72 shows that maximizing exascale performance is now less about raw interconnect speed and more about intelligent, topology-aware workload orchestration that respects hard hardware boundaries. The era of treating the network fabric as a best-effort resource is ending for high-performance AI.

>> Verify Original Transmission at NVIDIA