Which tokens does a hybrid AI model predict best | Keryc
In recent years we've seen hybrid models that mix attention layers with recurrent layers. So what do they gain and what do they lose compared to a pure transformer? Olmo Hybrid and Olmo 3 were designed to answer that exact question, token by token.
Experiment and method
The idea was simple and elegant: compare two models that are as similar as possible except for their architecture. Olmo 3 is a 7B transformer, Olmo Hybrid is its hybrid version, and both use the same data, tokenizer and training recipe. That means any difference in their predictions mostly reflects the architecture.
To measure this, they ran a variety of texts through both models: articles, Wikipedia entries, books, scientific papers and structured text like Python, HTML and LaTeX. Each model assigned a probability to the actual next token in a sequence and that probability was recorded. By comparing token by token they compute the loss gap, i.e., the difference in loss between the models. A positive gap favors the hybrid; a negative one favors the transformer.
How to isolate fine-grained effects
A simple average isn't enough. Some token categories are rare or repeat a lot, which distorts simple results. So the researchers did two things: 1) group tokens by category and average the loss gap, and 2) run regressions that control for factors like frequency and repetition. That way real effects attributable to architecture emerge.
Which types of tokens the hybrid favors
The central finding is clear: the hybrid has a real edge in many token classes, but not in all.
The hybrid stands out on content tokens: nouns, verbs, adjectives and adverbs. In other words, the words that carry meaning about what the sentence is about.
It also shows strength on tokens that require following the flow of the text, for example resolving pronouns and reference decisions, where you need to maintain state tracking.
By contrast, its advantage shrinks a lot on functional tokens like "the", "of" or "is", which syntax almost determines.
A notable pattern: when the next token is a literal copy of something that already appeared earlier in the same passage, the hybrid's advantage almost disappears. On repeated n-grams the ability to copy exactly favors the transformer a lot, and the longer the repetition the less advantage the hybrid has.
Why these differences occur: attention vs recurrent
Think of each layer as a reader that refines the representation of each word using context.
In a transformer, attention lets you look directly at any previous token and weigh its relevance. That's excellent for retrieving an exact word that appeared many positions ago. The problem is attention scales in cost with context length and isn't naturally the best tool for carrying a sequential state that evolves.
In a recurrent layer, the model reads left to right and accumulates a fixed-size memory. That memory is compressed and somewhat lossy, so it can't retrieve an exact copy as well as attention. But it's very good at keeping track of state or how information changes as you go, which helps predict words related to the ongoing meaning.
That complementarity explains why a hybrid, combining a few attention layers with recurrent ones, can get the best of both worlds.
Evaluation by token type: practical utility
Inspired by the results, the authors propose using losses filtered by token type as a fine-grained pretraining evaluation method. They tested three 1B models: pure transformer, hybrid and pure recurrent. The findings:
On meaning-bearing tokens that aren't repetitions, the hybrid and the pure recurrent beat the transformer, with the hybrid being the best.
On tokens that require copying verbatim (repeated), the pure recurrent falls behind for not having attention, and the transformer usually dominates.
This shows a global loss metric is too blunt for comparing architectures. Losses filtered by token type reveal early training differences, like copy skills and state-tracking.
What to learn and what's next
Two practical lessons:
Don't look only at average loss. If you want to understand why a design works better, inspect loss on relevant token subsets.
Hybrids seem especially good on words that convey meaning and on state tracking, probably thanks to the recurrent layers.
To design better hybrids you need to understand, token by token, what each component contributes. That will enable more efficient and accurate architectures tailored to the task: coherent long-range generation, exact text copying, or resolving complex references.
If you're curious to experiment, the authors invite you to try Olmo 3, Olmo Hybrid and the open artifacts to play with these ideas. Want to evaluate a model on your own corpus? Filtering losses by token type is a powerful, practical tool to get clear answers.