Hugging Face launches Storage Buckets for ML artifacts | Keryc
Imagine a place on the Hub where you drop everything that’s “in motion”: checkpoints, processed shards, logs and traces. Storage Buckets arrive to cover exactly that: mutable, S3-like storage designed for ephemeral, high-performance artifacts that ML generates in production.
What Storage Buckets are and why they matter
A Bucket is a non-versioned container inside the Hub. It lives under your user or organization, respects Hugging Face permissions, can be private or public, has a web page and a programmatic address like hf://buckets/user/my-bucket.
Why not use Git for this? Have you seen how noisy a training run gets when it writes checkpoints every few minutes? Git wasn’t built for large, mutable objects that change constantly. Buckets are designed to write fast, overwrite when needed, sync directories and remove obsolete files without fuss.
The technical advantage: Xet and chunk deduplication
Buckets are built on Xet, the chunk-based storage backend. Instead of treating every file as a monolithic blob, Xet splits content into pieces and deduplicates across them.
What does that mean for your pipelines?
You upload a processed dataset that’s very similar to the raw one and many chunks already exist. You don’t resend bytes that are already there.
You store successive checkpoints where many layers don’t change. Shared chunks aren’t stored again.
Result: less bandwidth, faster transfers and more efficient storage. For Enterprise customers this translates into billing on deduplicated storage, so sharing chunks reduces cost.
Global performance and pre-warming by region
Buckets live on the Hub and by default are globally accessible. But latency matters when you run distributed training or large-scale pipelines.
The pre-warming feature lets you bring hot data closer to the provider and region where your compute runs. Instead of moving data between regions on every read, you declare where you need it and Buckets place it there before the job starts. Very handy for training clusters and multi-region setups.
Hugging Face starts with integrations for AWS and GCP. More providers will come later.
Quickstart with the CLI
You can create a Bucket in under 2 minutes with the hf CLI. Minimal example:
There are also commands to copy single files with hf buckets cp and to clean objects with hf buckets remove.
Programmatic integration: Python, JavaScript and fsspec
If you prefer code, the huggingface_hub library supports Buckets since v1.5.0. Integrating it into training scripts is straightforward:
from huggingface_hub import create_bucket, list_bucket_tree, sync_bucket
create_bucket('my-training-bucket', private=True, exist_ok=True)
sync_bucket(
'./checkpoints',
'hf://buckets/user/my-training-bucket/checkpoints',
)
for item in list_bucket_tree('user/my-training-bucket', prefix='checkpoints', recursive=True):
print(item.path, item.size)
For Node.js apps there’s support in @huggingface/hub since v2.10.5.
Also, Buckets work with HfFileSystem, compatible with fsspec. That means any library using fsspec can read and write directly to a Bucket. Practical examples:
from huggingface_hub import hffs
# list
hffs.ls('buckets/user/my-training-bucket/checkpoints', detail=False)
# glob
hffs.glob('buckets/user/my-training-bucket/**/*.parquet')
# read
with hffs.open('buckets/user/my-training-bucket/config.yaml', 'r') as f:
print(f.read())
And for Pandas or Polars:
import pandas as pd
# read CSV directly from a Bucket
df = pd.read_csv('hf://buckets/user/my-training-bucket/results.csv')
# write results
df.to_csv('hf://buckets/user/my-training-bucket/summary.csv')
That makes it very easy to connect Buckets to existing pipelines without rewriting how you read or write files.
Good pattern: mutable layer vs versioned layer
Buckets are the place for things that are in motion. When something becomes a stable deliverable, the usual practice is to promote it to a versioned model or dataset repo on the Hub.
On the roadmap is support for direct transfers between Buckets and repos in both directions: promote final checkpoints to a model repo, or upload processed shards to a dataset repo. That way the working layer and the publication layer coexist in a continuous, native Hub workflow.
Early experiences and adoption
Before public launch there was a private beta with partners like Jasper, Arcee, IBM and PixAI. That feedback helped improve reliability and usability.
Buckets are already included in Hub storage plans. Free accounts get space to get started, and PRO and Enterprise plans offer higher limits. For Enterprise, billing takes deduplication into account.
If you’re coming from S3, the experience will feel familiar, but with better guarantees for AI artifacts thanks to Xet and Hub integration.
Using Buckets lets you keep more of the ML lifecycle in one place: from experimentation to final publication.