EMO: Pretrained MoE achieves emergent modularity

EMO arrives as an experiment that changes how we think about sparse models: it's not just a big MoE, it's a MoE trained so that modularity emerges from the data. What does that mean for you? That a single model can behave like many specialized modules and, at the same time, let you use only a fraction of its parameters for concrete tasks.

What is EMO and why does it matter?

EMO is a mixture-of-experts (MoE) pretrained end-to-end with modularity as an explicit objective. The reported version has 128 total experts, activates 8 per token, and corresponds to 14B total parameters with 1B active parameters; it was trained on 1 trillion tokens. Its big bet: let coherent groups of experts emerge without human labels, using only signals from document structure.

Why does this matter? Because monolithic models eat huge memory and compute even when the task needs only a small capacity (for example, generating code or answering medical questions). If you can identify and load only the relevant experts, you cut deployment costs and make adaptation easier.

What is EMO and why does it matter?

How it works (technical level)

Routing by document

Global load balancing and pool size

Key results and metrics

Practical usage example

Limitations and open questions

What the team releases

Original source

Stay up to date!

EMO: Pretrained MoE achieves emergent modularity