How does Dynamo Snapshot function on Kubernetes without modifying the container runtime like runc?

It utilizes a privileged DaemonSet called `snapshot-agent` that runs on every node. This agent orchestrates the checkpoint and restore process from the host side, invoking CRIU and cuda-checkpoint directly. This approach makes the system portable and independent of any cloud provider or runtime-specific support for checkpoint/restore features.

NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes

NVIDIA Targets AI Cold-Start Latency with Dynamo Snapshot

NVIDIA has introduced Dynamo Snapshot, a system designed to address the significant "cold-start" latency common in scaling AI inference workloads on Kubernetes. The technology enables a near-instant checkpoint and restore of single-GPU applications by serializing both the host process and GPU device state. This approach aims to make elastic scaling more responsive, allowing operators to rapidly provision new inference replicas to meet service level agreements during sudden traffic spikes, rather than having expensive GPUs sit idle for several minutes during a traditional startup sequence.

Technical Deep Dive: CRIU and CUDA Checkpoints

Dynamo Snapshot's functionality is built on two core tools: CRIU (Checkpoint/Restore in Userspace) for capturing the CPU-side process state, and cuda-checkpoint for serializing the GPU-side device state. The system is deployed on Kubernetes using a privileged `snapshot-agent` DaemonSet, which handles the checkpoint and restore operations without needing modifications to the underlying `runc` container runtime. This design choice ensures portability across different cloud environments. A critical feature is the use of quiesce/resume hooks, which allow the inference workload to clean up non-checkpointable resources, such as active network connections, and enter a quiescent state before being snapshotted. This optimizes the final checkpoint size and enables a seamless restoration.

Core Mechanism: Combines CRIU for host state and cuda-checkpoint for GPU device state.
Kubernetes Integration: Deploys as a portable `snapshot-agent` DaemonSet, independent of runtime-specific feature gates.
Workload Coordination: Uses quiesce/resume hooks to ensure a clean, optimized state for checkpointing.
Key Optimizations: Employs KV cache unmapping, parallel memory restore, and asynchronous I/O to accelerate the process.

Performance Impact and Roadmap

The performance enhancements detailed by NVIDIA are substantial. A primary optimization is the deallocation of the KV cache memory prior to checkpointing, which in one example reduced a model's checkpoint artifact size from ~190 GiB to only 6 GiB. Additionally, forthcoming modifications to CRIU to parallelize memory restoration promise to further reduce restore times. Early experimental results demonstrate up to a 21x reduction in startup time on large models like `gpt-oss-120b`, bringing restore times significantly closer to ideal speeds. The project's roadmap indicates plans to extend support to multi-GPU/multi-node environments and integrate with frameworks like TensorRT-LLM.

Strategic Takeaway: By abstracting away the complex state management of GPU workloads, NVIDIA's Dynamo Snapshot addresses a critical operational bottleneck, potentially making on-demand, cost-effective scaling of large inference models a practical reality for more enterprises.

>> Verify Original Transmission at NVIDIA