How are Storage Buckets different from standard Hugging Face model or dataset repositories?

Storage Buckets are designed for mutable, non-versioned files common in active development, such as checkpoints, logs, and intermediate data shards. Unlike standard Git-based repos which version every change, Buckets function like S3 object storage, allowing for fast writes, overwrites, and deletions without creating a version history. This makes them better suited for the temporary storage needs of training pipelines, while final artifacts are meant to be published to versioned repos.

Introducing Storage Buckets on the Hugging Face Hub

Hugging Face has launched Storage Buckets, a new object storage service integrated into its Hub platform. The feature addresses the persistent need for managing the constant stream of intermediate files generated during machine learning development, such as checkpoints, processed data, and logs. Unlike the platform's existing Git-based repositories designed for final, versioned artifacts, Buckets provide a mutable, S3-like environment optimized for high-throughput, non-versioned data, aiming to streamline the MLOps workflow by keeping temporary and final assets within a single ecosystem.

The service is built on Xet, Hugging Face’s chunk-based storage backend, which deduplicates content across files. This architecture is particularly efficient for ML workloads where successive files, like model checkpoints, share significant amounts of data, leading to faster transfers and lower storage costs for enterprise clients billed on deduplicated footprint. A key operational feature is "pre-warming," which allows users to stage data in specific AWS or GCP regions closer to their compute clusters, reducing latency for distributed training. Developers can manage Buckets via the company's CLI, Python and JavaScript libraries, or through its fsspec-compatible filesystem, enabling direct integration with data libraries like pandas and Polars.

The introduction of Storage Buckets positions the Hugging Face Hub as a more comprehensive platform for the entire ML lifecycle, rather than just a destination for finished models and datasets. By providing a native solution for the often-chaotic intermediate stages of development, the company is competing more directly with general-purpose cloud storage services for a critical piece of the ML infrastructure stack. This move is likely to increase the platform's utility and stickiness for both individual developers and enterprise teams, including launch partners like Jasper, Arcee, and IBM, by consolidating more of their toolchain into a unified workflow.

By introducing Storage Buckets, Hugging Face is strategically expanding its platform to manage the entire ML lifecycle, directly challenging incumbent cloud storage providers for the high-volume, intermediate data workloads that precede final model publication.