EMO: Pretrained MoE achieves emergent modularity in AI

EMO is a new mixture-of-experts (MoE) model pretrained end-to-end that lets modularity emerge directly from the data, without imposing human-made divisions. What’s the result? You can use only a small subset of experts — 12.5% — for a task and keep almost the same performance as the full model.

Qué propone EMO y por qué importa

Most large language models are treated as monolithic blocks: you train one big model and everyone uses the whole thing. But in practice many applications only need a few specific capabilities: code generation, math reasoning, medical knowledge, and so on. That makes deploying and adapting giant models expensive and inefficient.

MoEs seem like the natural solution: each layer contains many experts and only a few activate per token. In theory you could load only the experts relevant for a task. In practice, standard MoEs don’t let you do that: during generation the same text activates different experts token-by-token, and specialization often lands on superficial patterns like prepositions instead of semantic domains.

Qué propone EMO y por qué importa

Arquitectura y cifras clave

¿Cómo logra EMO modularidad emergente? (detalles técnicos)

Balanceo de carga y tamaño de pool

Resultados y comparaciones

Qué hacen realmente los expertos

Código y artefactos

Límites y preguntas abiertas

Fuente original

Stay up to date!

EMO: Pretrained MoE achieves emergent modularity in AI