Olmix: framework to mix data in LM development

Feb 13, 20265 minutes

Olmix arrives as a practical answer to a problem you probably know well: mixing texts, code, PDFs and math in the right recipe to train a language model is more art than science. How much of each source do you use? How do you know if you’re wasting data or compute? Olmix proposes a repeatable, efficient workflow to decide and update that mix as you develop an LM.

What problem Olmix solves

When you build a language model, datasets change all the time: you add new corpora, filter junk, partition domains. Recomputing the mix from scratch every time is expensive. And the literature doesn’t give a consistent guide on how to set the mix: proxy model sizes, number of experiments, choice of regressor, and other decisions are left to intuition.

Olmix tackles two challenges: it provides empirically backed defaults so you’re not guessing configurations, and it offers mix-reuse techniques to update the mix efficiently as your corpus evolves.

OlmixBase: a starting point with experimental backing

OlmixBase bundles the findings of a large study aimed at practical questions: how small can your proxy models be? How many runs do you need? Which regressor should you use?

In the paper they trained 1B-parameter target models on the DCLM corpus, split into 24 thematic domains, and evaluated on 52 downstream tasks (math, code, commonsense QA). The main metric was bits-per-byte (BPB), which measures negative log-likelihood normalized by UTF-8 bytes.

These are the key takeaways in OlmixBase:

Proxy models need to be large enough to trust. Proxies above ~15M parameters keep a high ranking correlation with 1B target models (ρ > 0.89). Very small proxies (1M) drop to ρ = 0.73, too noisy for reliable mixing decisions. If you’re using tiny proxies to save compute, you might be losing useful signals.
Costs scale linearly with the number of domains. The size of the "swarm" (the proxy runs) should grow in O(m) for m domains. That gives a practical rule to budget compute as your domain set grows.
Log-linear regression is a robust default. They tried several regressors and found that, overall, log-linear regression offers the best global fit and stays competitive on downstream validation. The regressor choice can depend on swarm size, but log-linear is a solid starting point.
Constrain repetition of rare data. A common failure mode is the optimizer assigning too many iterations to a small domain (for example, code), forcing harmful repetition. OlmixBase includes feasibility constraints to avoid allocating more training than the real data availability.

All of this comes "preconfigured" in OlmixBase to give you a reasonable, reproducible starting point in early development.

Mix reuse: how to update without recomputing everything

The second big piece of Olmix is mix reuse. The practical insight is simple: when you change your corpus you usually alter a few domains, not all of them. So you don’t need to recompute a global mix if you can reuse the relationships between domains that didn’t change.

The central technique is to group unchanged domains into a "virtual domain" and solve a much smaller mixing problem that includes that virtual domain plus the changed domains. Then you expand the solution back to the full space. This drastically reduces the number of proxy runs because cost scales linearly with the number of domains you optimize.

Olmix presents two strategies:

Full Mixture Reuse: keep proportions for all unchanged domains fixed and recompute only the changed ones. It’s extremely efficient and in many scenarios captures most of the benefit of full recomputation.
Partial Mixture Reuse: when there are couplings between domains (for example, the web section on "software development" competes with a new code corpus for programming tasks), you also recompute a selected subset of unchanged domains that are coupled with the new ones. This reduces coupling effects and closes the gap to full recomputation with only a few extra runs.

Practical results (experiment summary)

For a realistic sequence of 5 updates ending at 64 domains, training 1B-parameter models on 100B tokens:

Full Mixture Reuse achieves 95% of the improvement you'd get from full recomputation, using 74% fewer proxy runs (216 vs 832).
Partial Mixture Reuse reaches 98% with 67% fewer runs.
The best mix found via reuse techniques beats the natural distribution (no mixing baseline) by 12.2% and is 3.05x more data-efficient.

The final mix tends to overweight high-value domains like arXiv, FineMath and code, which makes sense if your downstream tasks include academic and complex math problems.

Why this matters for your project (practical tips)

If you train LMs with heterogeneous data, Olmix helps you stop guessing and spend less compute when your corpus changes.

Quick recommendations:

Start with OlmixBase to get an initial mix: use proxies ≈ 15M parameters, apply log-linear regression and set repetition constraints based on data availability.
When the corpus changes, use Full Mixture Reuse if the changes touch few domains without big semantic overlaps.
If you detect couplings (new data competing with existing parts of the corpus for the same tasks), apply Partial Mixture Reuse and include those domains in the recomputation.
Measure BPB and downstream metrics relevant to your tasks; don’t assume the BPB-optimal mix will be identical for every task.

Final reflection

Olmix isn’t just another academic algorithm: it’s a toolbox designed for the real flow of LM development. If you recognize the pain of recomputing mixes every iteration, or of trusting proxies so small they give noisy signals, Olmix offers practical rules and update techniques that scale with your needs.

Data mixing is a first-order lever in model quality. Olmix reminds us that with good experimental choices and a bit of engineering you can turn that messy art into a repeatable, efficient flow.

Original source

https://allenai.org/blog/olmix

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.