Transformers gain speed: MXFP4, kernels and parallelism

4 minutes
HUGGINGFACE
Transformers gain speed: MXFP4, kernels and parallelism

OpenAI released the GPT‑OSS family and Hugging Face updated transformers to take advantage of several optimizations that make loading and running large models much faster and cheaper to operate. The blog post was published on September 11, 2025 and summarizes practical changes anyone can try today. (huggingface.co)

What's in the update

It’s not just a patch. It’s a collection of carefully thought-out improvements to reduce memory, speed up inference, and make it easier to split work across GPUs. Sound like something only data centers should care about? Think again: many of these upgrades work on consumer GPUs and environments like Colab.

  • Kernels downloadable from the Hub: transformers can now fetch precompiled binaries (kernels) that run critical operations close to the metal, avoiding local compilation and reducing dependency pain. You enable this with use_kernels=True or let automatic kernels kick in when appropriate. (huggingface.co)

  • Native support for MXFP4: a 4-bit quantization designed to keep dynamic range using blocks and scales, which lets you run models like GPT‑OSS 20B in around 16 GB of VRAM and much larger models using a fraction of previous memory. That makes it possible to run big models on a single GPU where they wouldn’t fit before. (huggingface.co)

  • Better integrated parallelism (Tensor and Expert Parallelism): transformers adds tp_plan="auto" and MoE (Mixture of Experts) support that make it easier to shard both tensors and experts across processes and GPUs, improving throughput on heavy workloads. No cluster? There are paths to run on local processes with torchrun too. (huggingface.co)

  • Dynamic cache and sliding window: for models that use sliding window attention, the KV cache no longer grows forever — it trims to the real window size, cutting memory use and latency on long prompts. Super handy when you mix global and local layers. (huggingface.co)

  • Continuous batching and faster loading: support for generate_batch (continuous batching) reuses empty slots in batches and improves efficiency when requests have different lengths; plus, transformers now preallocates GPU memory blocks before copying weights, speeding up loading of large models. (huggingface.co)

What this means for you

If you’re a developer or researcher: less time waiting for models to load, more time testing ideas. With MXFP4 and Hub kernels you can experiment on a single GPU with models that used to need clusters. Want to fine-tune or save a quantized model? The workflow is already integrated.

If you’re a product engineer or founder: lower operating costs and nimbler deployments. Less memory means cheaper instances or the option to offer larger models to end users without multiplying infrastructure spend.

If you’re starting out: you don’t need to compile exotic drivers or wrestle with toolchains. Many optimizations come by default or via simple options like device_map="auto", use_kernels=True or tp_plan="auto". Here’s a minimal example to try gpt-oss-20b:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "openai/gpt-oss-20b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, dtype="auto", device_map="auto", use_kernels=True, )

Try measuring load times and VRAM with and without these options; the difference is often clear.

Risks and practical limitations

Nothing magical: MXFP4 requires triton>=3.4, kernels and accelerate, and GPUs with compute capability >= 7.5 to get the best results. If your environment doesn’t meet those requirements, transformers falls back to bfloat16 by default, which uses more memory. Also, some kernels are architecture-specific (for example, Flash Attention 3 with attention sinks support on Hopper hardware). It’s always wise to benchmark on your setup. (huggingface.co)

How to start today (quick)

  1. Update transformers, torch and accelerate.
  2. Try loading a model with device_map="auto" and use_kernels=True to see logs confirming which kernels are used.
  3. If your GPU meets the requirements, enable MXFP4 and compare VRAM usage after running hf cache scan to check which kernels were downloaded.

Sound like a lot to do? Start with a small example (e.g., GPT‑OSS 20B in Colab) and scale up as you see how cost and latency change.

Final reflection

This isn’t optimization for hobbyists only: these changes lower the barrier for large models to be useful in real environments, from notebook prototypes to production services. The takeaway? The community is shifting infrastructure toward more practical, reusable solutions: downloadable kernels, common quantization formats, and parallelism plans that used to need heavy engineering. If you want to experiment, you now have more direct tools to try and compare results in your own setup.

Summary: Hugging Face updated transformers to use kernels from the Hub, MXFP4 (4-bit), parallelism (tensor and experts), dynamic cache and continuous batching — all aimed at reducing memory and speeding up large models. The September 11, 2025 post shows how these improvements let you run massive models with lower cost and infrastructure impact. (huggingface.co)

Stay up to date!

Receive practical guides, fact-checks and AI analysis straight to your inbox, no technical jargon or fluff.

Your data is safe. Unsubscribing is easy at any time.