Mixture of Experts optimizes Transformers and LLMs | Keryc
Over the last few years, scaling dense models was the go-to recipe to improve LLMs: more data + more parameters = better performance. But that runs into practical limits: expensive training, rising inference latency, and memory needs that not everyone can cover.
What's the solution? Mixture of Experts (MoEs). They keep the Transformer backbone, but replace dense feed‑forward layers with a set of experts (trainable sub-networks). A router decides, token by token, which few experts process each input. The result: high total capacity for the model while inference speed depends only on the parameters active per token. Sounds attractive, right?
Why MoEs matter for engineers and teams deploying LLMs
MoEs offer three clear advantages you should care about:
Compute efficiency: with the same FLOPs training budget, MoEs tend to outperform dense models, letting you iterate faster.
A natural axis for parallelism: different tokens activate different experts, so you can parallelize by expert (expert parallelism), which is handy in clusters.
Industrial adoption: recent models like Qwen 3.5, MiniMax M2, or GLM-5 use MoEs; DeepSeek R1 was a turning point.
A practical example: gpt-oss-20b has 21B total parameters, but only 4 experts active per token out of 32. That translates to ~3.6B active parameters per token. On an M3 Ultra (bandwidth ~800 GB/s) using bfloat16 (2 bytes per parameter), a quick estimate gives ~111 tokens/s; measured was ~115 tok/s. In other words: latency similar to a 3.6B model, quality closer to 21B. Not bad, huh?
Engineering: the main pain of MoEs and how Transformers solved it
The ecosystem was built for dense models: loading checkpoints, device placement, quantization and execution. MoEs break several assumptions:
In checkpoints, each expert is often serialized as separate tensors: model.layers.3.mlp.experts.0.gate_proj.weight ... experts.255.gate_proj.weight.
At runtime, modern kernels expect weights packed into a single contiguous tensor so they can use grouped GEMMs or fused kernels.
That mismatch between checkpoint layout and runtime layout forces weight transformations when loading the model. The fix was to refactor loading with an abstraction called WeightConverter.
What does WeightConverter do?
Think of loading not as a simple key-to-key copy, but as a conversion pipeline from the serialized checkpoint structure to the layout the runtime expects.
WeightConverter defines source → destination patterns plus operations. Primitive ops (chunk, concatenate, etc.) are composable. Two key operations for MoEs:
MergeModulelist: merges a list of tensors into one. Useful to stack experts.
SplitModulelist: splits a tensor into multiple tensors. Useful to go back to per-expert format.
Loading is also better planned: it scans keys once, groups by converter, materializes tensors with a thread pool and runs conversions only when dependencies are ready. That reduces memory spikes and avoids repeated scans.
Practical results: faster loading and integrated quantization
Benchmarks comparing transformers branches v4 vs v5 show significant improvements in load times for large MoE models (example with Qwen/Qwen1.5-110B-Chat on 1× A100 80GB):
v4 (device_map="auto", threadpool): 66.24 s
v5 (device_map="auto", async): 20.71 s
v5 (TP, async): 10.1 s
It's not just using more threads: the combination of one‑pass routing, asynchronous materialization and conversion‑aware planning lets the loader pack experts and fuse projections while loading.
Also, the refactor allows integrating quantization into the same conversion pipeline. Quantizing "per expert" only makes sense if experts are in a predictable packed layout.
Efficient execution: expert backends
Packing weights is only half the job. At inference each token is routed to a subset of experts and you need to:
Dispatch tokens to the selected experts' weights.
Run projections efficiently.
Apply routing weights and regroup results in the original order.
Transformers solved this with a pluggable Experts Backend system that decouples expert computation from the model implementation. You select the backend at runtime with a decorator @use_experts_implementation.
Current backends:
eager: loops per expert; good for debugging and correctness reference.
batched_mm: replicates selected weights per token and uses torch.bmm; good for small batches on GPUs with available memory.
grouped_mm: sorts tokens by expert ID and uses torch._grouped_mm; shines in large batches or memory‑constrained setups.
Expert Parallelism: how to scale beyond a single GPU
MoEs can reach hundreds of billions of parameters. The trick is that each token activates few experts, so you don't need all weights on every GPU. enable_expert_parallel=True activates the expert‑parallel plan and changes sharding.
Key components:
GroupedGemmParallel: shards expert weights along the experts dimension (dim=0). Each device loads num_experts / num_devices.
RouterParallel: remaps global expert indices to local ones, masks experts not assigned to the rank and uses an all‑reduce to combine partial outputs.
model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-120b", dtype="auto", distributed_config=distributed_config)
Recommended launch: torchrun --nproc-per-node N script.py, where N evenly divides the total number of experts.
Training: still challenging, but with progress
Training MoEs is more complex than running inference: distributed communication between experts, routing instabilities, and massive parameter counts. Collaboration with Unsloth shows important optimizations:
~12× faster MoE training
35% VRAM reduction
~6× longer context
12–30× overall speedup versus v4
These gains combine the experts backend, standardization on torch._grouped_mm and custom Triton kernels for grouped‑GEMM and LoRA.
What does this mean for you as a developer or product lead?
If you build models or inference pipelines at scale, MoEs let you keep quality with lower latency and hardware investment.
If you work on infra, pay attention to checkpoint loading and weight layout: the new WeightConverter pipeline and asynchronous loading change the game for bringing large models into memory.
If you research, there’s space for new routing strategies, stability metrics, and optimized kernels that reduce communication and improve efficiency.
MoE adoption is no longer a lab curiosity: libraries and tooling are evolving to make them practical in production.