For years, loading terabyte-scale data has been the most frustrating bottleneck in any training pipeline. Sound familiar — waiting hours before you can start a run, or watching workers crash because storage can’t keep up? Hugging Face just shipped a big upgrade to datasets and huggingface_hub that changes that picture: more efficient, safer streaming built to train on multi‑TB datasets without downloading anything.
What they achieved and why it matters
In practical terms: you can still call load_dataset(..., streaming=True) like before, but the backend now resolves and streams files much better. Concrete results? Far fewer startup requests (up to 100× less), file resolution 10× faster, up to 2× streaming throughput, and fewer in‑flight requests. In tests with 64×H100 and 256 workers, streaming even beat reading from a local SSD.
This isn’t just convenience: it cuts latency, avoids overloads that cause 429s, and removes those hours of prep before a run. For teams with modest clusters or limited storage budgets, it means you can start training sooner and with less friction.
What changed technically?
They worked on two key phases: startup (listing and resolving files) and streaming (throughput during training).
-
Startup: they implemented a persistent file-list cache shared across all DataLoader workers. The first worker queries the Hub; the others read the local cache. Result: no more request storms when every worker tries to resolve the repo on its own.
-
Resolution optimization: the process of getting the file list is batched and minimizes API calls, reducing latency in the initial step.
-
Streaming: they added prefetching for Parquet and configurable buffering. While your model processes the current chunk, the library is already fetching the next one. You can also tweak parameters like minimum range size and number of blocks to prefetch, so you can tune the pipeline to your hardware and network.
Quick technical example (Parquet + pyarrow)
To change the minimum range size from 32MiB to 128MiB and enable prefetch:
import pyarrow
import pyarrow.dataset
fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(
cache_options=pyarrow.CacheOptions(
prefetch_limit=1,
range_size_limit=128 << 20,
),
)
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)
They also improved HfFileSystem in huggingface_hub to read remote files with open, read, seek semantics and to reuse .ls() and .glob() results across workers.
from huggingface_hub import HfFileSystem
path = f"hf://datasets/{dataset_id}/{path_in_repo}"
with HfFileSystem().open(path) as f:
data = f.read() # or f.readline(), or .seek() for random access
Passing an instance of HfFileSystem to a DataLoader means the cached listings are shared and redundant calls are eliminated.
Other important pieces: Xet and Parquet CDC
Hugging Face uses Xet, a deduplicating storage that speeds uploads and downloads because it doesn’t retransmit duplicate data. For Parquet they implemented Content Defined Chunking (CDC), which improves deduplication and makes uploading/streaming large datasets faster than on traditional remote stores.
There are also integrations with pyspark_huggingface to speed transfers from Spark and support for custom pipelines (for example, WebDataset or frame sampling in video with LeRobot).
How to take advantage of it today
- Update your libraries:
pip install --upgrade datasets huggingface_hub
- Use
streaming=Trueas always:
from datasets import load_dataset
dataset = load_dataset("HuggingFaceM4/FineVisionMax", split="train", streaming=True)
print(next(iter(dataset)))
- If you want to squeeze more performance, tweak
fragment_scan_options(or equivalent) to increaserange_size_limitandprefetch_limit, and experiment with worker count and batch size to balance CPU, network and GPU.
Practical results and use cases
- Train VLMs (like nanoVLM or SmolVLM) without routing through S3 or downloading terabytes to your disks.
- Faster experimentation: inspect datasets and run small tests without long waits.
- Teams with many concurrent workers avoid overloads and crashes from too many requests.
In practice, Hugging Face reports that with these changes, streaming can match or beat local SSD reads on their training clusters: a real shift for teams that used to stage data on disk before every run.
Final thought
It’s not always about bigger models; sometimes the biggest win comes from removing friction at the data input. These improvements show that optimizing I/O and worker coordination gives performance jumps you feel in real time: less waiting, fewer errors, and more iteration time on models and data.
If you work with large datasets, update your libs, try streaming, and tweak the buffering. The surprise? Your GPU might spend more time training and less time waiting for data.
