Which tokens does a hybrid AI model predict better? | Keryc
Hybrid models are gaining ground, but what do they actually do better: remember exact words or follow the semantic thread of a text? Ai2 compared its strongest transformer (Olmo 3, 7B) with Olmo Hybrid to answer exactly that, token by token.
Resumen del experimento
Think of it like this: two models built with the same data, the same tokenizer and the same training recipe, but with different architectures. The behavioral differences you see between them likely come from the architecture itself. Ai2 put Olmo 3 (transformer) and Olmo Hybrid head-to-head on a battery of texts — articles, Wikipedia entries, books, papers and structured text like Python, HTML and LaTeX — and measured, for each token, which model gave higher probability to the actual next token.
The key measure is the loss gap: the difference in loss between models for each token. If the loss gap is positive, the hybrid predicts better; if it's negative, the transformer wins.
Atención versus recurrencia — y cómo se distingue su influencia
A transformer applies attention in every layer: it can look at any previous token directly and decide how much it should influence the prediction. That’s great for copying something that appeared far back in the context, but the computational cost grows with the square of the input length. Also, attention isn’t ideal at representing how a state changes as text progresses.
A hybrid model keeps some attention layers but replaces others with recurrent layers. Recurrent layers read left to right and maintain a fixed-size memory: processing long inputs doesn’t raise the cost per token. That memory is compressed and lossy, so it’s not great at retrieving exact copies from far away, but it’s good at tracking and updating states that evolve (for example, who the subject is in a story, or the value of a variable in code).
Why compare token by token? Because architectural advantages show up on specific types of predictions. Ai2 didn’t just average losses: they categorized tokens and used regressions to isolate the effect of each category while holding other variables constant (frequency, repetition, etc.). That avoids misleading conclusions from simple averages.
Qué tokens favorecen a cada arquitectura
The main findings are clear and reproducible:
The hybrid wins on meaningful tokens: nouns, verbs, adjectives and adverbs. The typical loss gap is around 0.04 for content words, smaller (about 0.02) for function words like “the” or “of”. That suggests recurrent layers help track changing information and build stronger semantic representations.
The hybrid also outperforms the transformer on contextual resolution tasks, for example understanding which person a pronoun refers to. There, the ability to trace state sequentially seems to pay off.
Conversely, the hybrid’s advantage all but disappears when the next word is a literal repetition of something already in the text. Ai2 looked for repeated n-grams: the longer the repeated sequence, the smaller the hybrid’s advantage until it’s nearly zero. In those cases the transformer does better or ties, because attention can retrieve a distant token precisely.
One specific and consistent case: the transformer predicts closing braces or parentheses better in language, code and markup. The reason is that parenthesis pairing is a pattern attention can represent exactly without recurrent help.
Evaluaciones filtradas: una forma más sensible de comparar arquitecturas
Ai2 proposed using filtered losses — that is, measuring loss only on tokens that test a specific skill — as a comparison metric in pretraining. They tried this with three 1B-parameter models: pure transformer, hybrid and pure recurrent (no attention):
On content tokens that are not repetitions, both the hybrid and the pure recurrent beat the transformer; the hybrid is the best.
On tokens that are repetitions, the pure recurrent lags (lacking attention to copy), while transformer and hybrid perform better.
These filtered metrics reveal fine differences (copying vs. reasoning about states) much earlier than the global average loss does.
Implicaciones prácticas y recomendaciones
What do these results tell you if you work with models or design architectures?
Don’t rely only on average loss to compare architectures. If your application needs to track states or reason about changes (long summaries, dialogues with pronoun references, code analysis that requires following variables), an evaluation filtered on content tokens will give you clearer signal.
If your task requires literal copying (code completion with many repeated tokens, templates, responses that reuse exact phrases), attention remains a key component.
For large-scale models, hybrids seem to offer a good balance: they keep reasonable copying ability (thanks to the attention layers they retain) and improve sequential tracking and representation thanks to recurrent layers, with friendlier compute costs on long contexts.
Hacia dónde va esto
The main lesson is methodological: measuring by token types gives a more useful X-ray of what each model component does well. Ai2 will use these ideas to iterate on more efficient hybrids and to better understand why certain layers help on concrete tasks.
A model is not just an average number. Knowing which tokens it favors lets you design architectures and metrics aligned with real utility.
If you’re interested in building or evaluating models for specific tasks, start by identifying the critical token types in your domain and measuring filtered losses — it can completely change which architecture you choose.