BAR: modular technique for AI post-training with MoE

Apr 20, 2026Keryc Díaz4 minutes

After pretraining, language models go through several mid- and post-training stages to become useful: following instructions, reasoning, using tools, and staying safe. What happens when you want to add or improve a skill after all that work? Do you retrain from scratch and wait ages, or risk the model forgetting things?

BAR offers another route: train experts separately and combine them with a MoE architecture so you keep flexibility and avoid regressions.

Qué es BAR y por qué importa

BAR stands for Branch-Adapt-Route. The key idea is to modularize post-training: each domain (for example math, code, tool use, safety) is trained as an independent expert that goes through its own mini pipeline. Then you combine them into a single model using a mixture-of-experts and a router that decides which expert to activate for each input.

Why does this matter to you? In real model development you have different teams, different timelines, and datasets that arrive asynchronously. Re-running the whole pipeline every time you update code or data is expensive and impractical. BAR gives you an alternative that scales linearly with domain updates.

Cómo funciona BAR

BAR has three clear stages: independent expert training, expert merging, and router training. There’s a central technical contribution: a progressive unfreezing guideline for shared parameters that changes depending on the post-training stage.

Etapa 1 - Entrenamiento independiente de expertos

Each expert is implemented as a MoE with two experts: a frozen "anchor" that preserves the base model's FFN weights and a trainable expert.
Experts go through the stages their domain needs. In BAR's experiments, math and code went through mid-training, SFT and RLVR. Tool use and safety used SFT only.
Progressive unfreezing:
- Mid-training: all shared layers are frozen. This works because knowledge acquisition often lives in the FFN.
- SFT: the embedding layer and the language head LM head are unfrozen. This is necessary if the domain adds special tokens or new output formats. In the BFCL benchmark, the tool-use expert jumped from 20.3 to 46.4 after enabling this unfreeze.
- RLVR: all shared parameters are unfrozen, including attention. RL produces behavior changes that can’t be captured by FFN alone.
Each expert trains on a mix of domain-specific data and general SFT data. Training only on domain data improves domain performance but destroys general capabilities.

Etapa 2 - Fusión de expertos

After training, experts are merged into a single MoE.
Shared parameters that diverged across runs (because they were unfrozen) are averaged. Surprisingly, this averaging causes little or no measurable loss in domain evaluations.

Etapa 3 - Entrenamiento del router

The MoE router is trained with all expert and shared weights frozen.
A stratified 5% sample of the SFT data was enough to get an effective router, making this stage fast and cheap.

Resultados y comparaciones

Evaluated models are at least 7B and start from an Olmo 2 already post-trained.
BAR beats all alternatives that avoid redoing mid-training from scratch: 49.1 vs 47.8 for retrain-with-post-training-only.
Notable gains in math (+7.8) and code (+4.7) versus that baseline.
Dense fusion after mid-training fails catastrophically: average 6.5. Without mid-training, dense fusion scores 36.9 vs BAR’s 49.1.
BTX, which trains experts as fully independent dense models, falls short of BAR (46.7 vs 49.1), likely because the lack of shared parameters increases divergence and complicates routing.
Full retraining including mid-training remains the performance ceiling (50.5) but demands full access to the pretraining checkpoint and reprocessing everything, which is expensive.

Actualizaciones modulares: coste y beneficios

A practical advantage of BAR is the ability to update experts individually:

Update data: replace the code expert with one trained on better data and RL — that gave +16.5 points in code in the combined model, without harming other domains.
Add a stage: adding RL to a math expert improved math by +13 points in the combined model, with minimal impact elsewhere.

These updates only require retraining the affected expert and the router, which implies linear cost per domain versus the near-quadratic cost of revisiting a monolithic pipeline.

Lecciones prácticas y recomendaciones

Post-training needs more flexibility than pretraining. You can’t just freeze all shared layers; you must selectively unfreeze depending on the stage.
Data mixing: never train SFT only on domain data. Mixing general data is critical to preserve broad skills like following instructions.
Weight averaging works better than many expect. Even if shared parameters diverge, averaging introduces little degradation.
Not every expert needs to be active all the time. Activating 4 of 5 experts reached nearly the same results, which suggests efficiency gains with more aggressive routing.

Mirando hacia adelante

BAR adapts the training workflow to the reality of model development: separate teams, asynchronous updates, and the need to iterate quickly on specific capabilities. It doesn’t replace full retraining if you want the absolute top performance, but it offers a practical, scalable, and cost-effective way to improve and extend models after pretraining.

A natural direction is to start from architectures that are natively sparse instead of upcycling dense models. That could boost efficiency and the scalability of modularity.

BAR isn’t just an academic trick; it’s a recipe for organizing training in real teams and lowering the cost of maintaining complex models without sacrificing safety or critical skills.

Original source

https://allenai.org/blog/bar

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.