EMO: Pretrained MoE achieves emergent modularity in AI | Keryc
EMO is a new mixture-of-experts (MoE) model pretrained end-to-end that lets modularity emerge directly from the data, without imposing human-made divisions. What’s the result? You can use only a small subset of experts — 12.5% — for a task and keep almost the same performance as the full model.
Qué propone EMO y por qué importa
Most large language models are treated as monolithic blocks: you train one big model and everyone uses the whole thing. But in practice many applications only need a few specific capabilities: code generation, math reasoning, medical knowledge, and so on. That makes deploying and adapting giant models expensive and inefficient.
MoEs seem like the natural solution: each layer contains many experts and only a few activate per token. In theory you could load only the experts relevant for a task. In practice, standard MoEs don’t let you do that: during generation the same text activates different experts token-by-token, and specialization often lands on superficial patterns like prepositions instead of semantic domains.
EMO changes that. Instead of imposing domains by hand, it makes modular structure arise during pretraining using document boundaries as a weak signal. Tokens from the same document are forced to pick their experts from a shared pool that the router chooses for that document. That way experts tend to cluster by real, reusable capabilities.
Arquitectura y cifras clave
Size: 14B total parameters, with 128 experts in total.
Active per input: the model activates 8 experts per token (1B active in total).
Pretraining data: 1 trillion tokens.
Useful subsets: you can use only 12.5% of the experts (16 experts) and lose only ~3% average performance; with 25% (32 experts) the drop is ~1%.
Those numbers are what let EMO be both modular and strong as a general model when you use all experts.
The critical component is the router, the small network that decides which experts to activate for each token. The core idea is that tokens from the same document tend to belong to the same “capability” or domain. During training, all tokens of a document must choose experts within a shared pool.
Concretely:
For each document, the average of the router’s preferences over experts is computed.
The most-used experts are selected as the document’s pool.
All tokens of the document can only route within that pool.
This lets recurring groups of experts emerge naturally from the corpus structure, without manual labels.
Balanceo de carga y tamaño de pool
There are two important engineering choices:
Load balancing: in MoEs there’s usually a loss to avoid all tokens collapsing into a few experts. If you apply that balancing locally (per micro-batch) it clashes with the goal that a document uses few experts. EMO solves this by applying load balancing at a global scale across many documents: documents share small pools internally, and globally the system is pushed to cover all experts.
Document pool size: controls how restrictive the modularity is. EMO doesn’t fix a single size: during training the pool size is sampled randomly. That prevents overfitting to one granularity and gives flexibility at inference time.
Resultados y comparaciones
On general benchmarks EMO matches a standard MoE with the same architecture when all experts are used. The real advantage appears when selecting expert subsets:
With 25% of experts active the drop is around 1%.
With 12.5% the drop is close to 3%.
The matched standard MoE degrades sharply when you reduce experts, sometimes down to near-random performance in the smallest configurations.
Also, picking the right experts for a task turns out to be surprisingly cheap: a single few-shot example can identify a module that performs almost as well as using a large validation set. EMO also works well with existing expert pruning methods like Easy-EP.
Qué hacen realmente los expertos
To investigate, the authors clustered router activations from tokens in 12K documents and saw a clear difference:
EMO produces semantic clusters: Health, Medical & Wellness, News Reporting, US Politics & Elections, Film & Music, etc.
A standard MoE produces superficial clusters: prepositions, proper names, copular verbs, definite articles.
In simple terms: EMO groups by real capability (topic or domain), not by lexical traits. That explains why a small group of experts can sustain a whole task.
Código y artefactos
The authors release the trained EMO model, a matched MoE baseline and the training code. That makes it easy to reproduce experiments and tackle open questions.
Límites y preguntas abiertas
EMO is an important step, but challenges remain:
How to select and compose expert subsets automatically and robustly.
How to update modules without breaking the full-model performance.
How to use modularity to improve interpretability and control.
These are great topics for community work now that the model and code are available.
In the end, EMO shows that if you design the pretraining signal and scaling of balancing well, modularity can emerge on its own. Can you imagine deploying giant models where you only load what you need and still get cutting-edge results? That doesn’t feel so far away anymore.