OpenAI released the GPT‑OSS family and Hugging Face updated transformers
to take advantage of several optimizations that make loading and running large models much faster and cheaper to operate. The blog post was published on September 11, 2025 and summarizes practical changes anyone can try today. (huggingface.co)
What's in the update
It’s not just a patch. It’s a collection of carefully thought-out improvements to reduce memory, speed up inference, and make it easier to split work across GPUs. Sound like something only data centers should care about? Think again: many of these upgrades work on consumer GPUs and environments like Colab.
-
Kernels downloadable from the Hub:
transformers
can now fetch precompiled binaries (kernels) that run critical operations close to the metal, avoiding local compilation and reducing dependency pain. You enable this withuse_kernels=True
or let automatic kernels kick in when appropriate. (huggingface.co) -
Native support for
MXFP4
: a 4-bit quantization designed to keep dynamic range using blocks and scales, which lets you run models like GPT‑OSS 20B in around 16 GB of VRAM and much larger models using a fraction of previous memory. That makes it possible to run big models on a single GPU where they wouldn’t fit before. (huggingface.co) -
Better integrated parallelism (Tensor and Expert Parallelism):
transformers
addstp_plan="auto"
and MoE (Mixture of Experts) support that make it easier to shard both tensors and experts across processes and GPUs, improving throughput on heavy workloads. No cluster? There are paths to run on local processes withtorchrun
too. (huggingface.co) -
Dynamic cache and sliding window: for models that use sliding window attention, the KV cache no longer grows forever — it trims to the real window size, cutting memory use and latency on long prompts. Super handy when you mix global and local layers. (huggingface.co)
-
Continuous batching and faster loading: support for
generate_batch
(continuous batching) reuses empty slots in batches and improves efficiency when requests have different lengths; plus,transformers
now preallocates GPU memory blocks before copying weights, speeding up loading of large models. (huggingface.co)
What this means for you
If you’re a developer or researcher: less time waiting for models to load, more time testing ideas. With MXFP4
and Hub kernels you can experiment on a single GPU with models that used to need clusters. Want to fine-tune or save a quantized model? The workflow is already integrated.
If you’re a product engineer or founder: lower operating costs and nimbler deployments. Less memory means cheaper instances or the option to offer larger models to end users without multiplying infrastructure spend.
If you’re starting out: you don’t need to compile exotic drivers or wrestle with toolchains. Many optimizations come by default or via simple options like device_map="auto"
, use_kernels=True
or tp_plan="auto"
. Here’s a minimal example to try gpt-oss-20b
:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "openai/gpt-oss-20b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, dtype="auto", device_map="auto", use_kernels=True, )
Try measuring load times and VRAM with and without these options; the difference is often clear.
Risks and practical limitations
Nothing magical: MXFP4
requires triton>=3.4
, kernels
and accelerate
, and GPUs with compute capability >= 7.5 to get the best results. If your environment doesn’t meet those requirements, transformers
falls back to bfloat16
by default, which uses more memory. Also, some kernels are architecture-specific (for example, Flash Attention 3 with attention sinks support on Hopper hardware). It’s always wise to benchmark on your setup. (huggingface.co)
How to start today (quick)
- Update
transformers
,torch
andaccelerate
. - Try loading a model with
device_map="auto"
anduse_kernels=True
to see logs confirming which kernels are used. - If your GPU meets the requirements, enable
MXFP4
and compare VRAM usage after runninghf cache scan
to check which kernels were downloaded.
Sound like a lot to do? Start with a small example (e.g., GPT‑OSS 20B in Colab) and scale up as you see how cost and latency change.
Final reflection
This isn’t optimization for hobbyists only: these changes lower the barrier for large models to be useful in real environments, from notebook prototypes to production services. The takeaway? The community is shifting infrastructure toward more practical, reusable solutions: downloadable kernels, common quantization formats, and parallelism plans that used to need heavy engineering. If you want to experiment, you now have more direct tools to try and compare results in your own setup.
Summary: Hugging Face updated transformers
to use kernels from the Hub, MXFP4
(4-bit), parallelism (tensor and experts), dynamic cache and continuous batching — all aimed at reducing memory and speeding up large models. The September 11, 2025 post shows how these improvements let you run massive models with lower cost and infrastructure impact. (huggingface.co)