Bolmo arrives as a clear bet: take the power of Olmo 3 and turn it into models that operate directly on UTF-8 bytes. Why does this matter? Because working at the byte level can solve real problems — spelling, rare words, handling multiple languages and edge cases — without giving up the performance that the best subword models have earned.
What is Bolmo and why does it matter?
Bolmo is a family of byte-level models (Bolmo 7B and Bolmo 1B) that doesn't start from scratch. Instead, it byteifies open checkpoints of Olmo 3, reusing its backbone and capabilities, and adapts them to an architecture that processes bytes with variable-length patches. The result: open models that, according to Ai2, match and sometimes outperform subword models on many tasks, and shine where bytes are most relevant.
Why prefer bytes over subwords? Subword models depend on a fixed vocabulary: they work well, but they fail at clear boundaries. Byte-level models avoid that manual vocabulary and handle spelling errors, typographic oddities and multilingual text better. Wouldn't that make things smoother when you paste weird names into a prompt or mix languages in a single message?
How Bolmo works (technical architecture)
Bolmo is a latent tokenizer language model. At a high level it operates in three stages:
Each UTF-8 byte is embedded and passed through a light local encoder (a stack of mLSTM) that produces byte-level contextual representations.
A non-causal boundary predictor decides patch boundaries using a bit of future context, and groups bytes into variable-length patches that get pooled.
The patches feed the global transformer (the original Olmo 3), then they are unpooled back to bytes and a local decoder refines the prediction of the next byte and the next boundary.
This strategy combines the best of both worlds: sensitivity to fine-grained text structure (bytes) and the strength of a large transformer already trained. Technically, Bolmo is in the same family as models like DTP, BLT and H-Net, but modified to reuse strong subword backbones.
Byteifying Olmo 3: two-stage training
Training byte-levels from scratch is expensive. Bolmo avoids that cost with a two-stage plan:
Freeze the Olmo 3 transformer and train only the local encoder, local decoder, boundary predictor and the language head. This stage uses 9.8B tokens (≈43B bytes) and is relatively cheap.
Unfreeze the whole model and continue training for an additional 39.3B tokens (≈173B bytes) so Bolmo can fully exploit byte-level information.
The key idea: you don't waste the investment in curated data, architecture or long-context training. You extend that work into byte space with a moderate additional cost.
Performance: where Bolmo stands out
Ai2 evaluated Bolmo on a broad suite (math, STEM reasoning, QA, code, general knowledge) and on character-focused tests like CUTE and EXECUTE. Highlights:
Bolmo 7B approaches Olmo 3 7B on most general tasks, and beats it by a wide margin on character-centered benchmarks. On the character aggregate, it improves by about 20 points over Olmo 3.
Compared to other similar byte-levels (BLT 7B, TFree-Hat 7B, EvaByte 6.5B), Bolmo 7B is the most robust on code, math, multiple-choice QA and character-level understanding, with a small exception in GenQA.
Bolmo 1B, byteified from Olmo 2 1B, competes well with previous byte-models at that scale and improves character understanding relative to its subword base.
This confirms the hypothesis: you can get the best of bytes (fine detail) without paying a large toll in overall performance.
Inference and practical efficiency
A common fear is latency: more tokens can mean more time. Bolmo tackles this with local mLSTMs and dynamic pooling. Indicative measurements:
Wall-clock decoding: ~125 bytes per second, versus ~150 bytes per second for the corresponding subword model at the same compression level.
You can speed this up by increasing the bytes-per-patch ratio.
An important advantage: compression is an adjustable control. Subword models hit the softmax bottleneck when they enlarge their vocabulary. Bolmo can raise the average bytes per patch without hitting that limit, which pushes the Pareto frontier between performance and compute cost.
Practical advantage: zero-cost upgrades
A powerful feature of byteifying is compatibility with the post-training ecosystem. Ai2 shows that, after byteifying, you can import improvements from post-trained checkpoints via weight merging without retraining in byte-space.
Concrete example: an Olmo 3 post-trained to follow instructions, when converted via arithmetic weight merging, greatly improves Bolmo on IFEval. Key numbers:
Bolmo base on IFEval: 31.1% versus 35.4% for the original subword.
After applying weight mixing, Bolmo rises to 67.4%, roughly matching the post-trained Olmo 3 at 66.9%.
That suggests an efficient workflow: byteify your strong model and reuse RL, fine-tunes and adapters through light weight merges, instead of redoing everything in bytes.
Important: this compatibility isn't guaranteed across all model families. It works well here because Olmo 3's embeddings can be "reset" without losing performance, a behavior that deserves more study.
What’s next and how to try Bolmo
Ai2 proposes several directions: experiment with richer boundary predictors, scale the process to larger models, create multilingual variants and let subword ecosystem improvements continue flowing into byte space.
If you want to try it now, Ai2 publishes checkpoints, code, data and the technical report to reproduce and extend the byteifying process. That makes it easy for researchers and developers to inspect internals, reproduce results and build byte-level systems on top of Olmo.
Think of practical apps: scientific assistants that fix notation, code editors that don't break with strange names, or multilingual models that don't rely on biased vocabularies. Doesn't that sound useful for many real-world cases?