Which tokens does a hybrid AI model predict best

In recent years we've seen hybrid models that mix attention layers with recurrent layers. So what do they gain and what do they lose compared to a pure transformer? Olmo Hybrid and Olmo 3 were designed to answer that exact question, token by token.

Experiment and method

The idea was simple and elegant: compare two models that are as similar as possible except for their architecture. Olmo 3 is a 7B transformer, Olmo Hybrid is its hybrid version, and both use the same data, tokenizer and training recipe. That means any difference in their predictions mostly reflects the architecture.

To measure this, they ran a variety of texts through both models: articles, Wikipedia entries, books, scientific papers and structured text like Python, HTML and LaTeX. Each model assigned a probability to the actual next token in a sequence and that probability was recorded. By comparing token by token they compute the loss gap, i.e., the difference in loss between the models. A positive gap favors the hybrid; a negative one favors the transformer.

Experiment and method

How to isolate fine-grained effects

Which types of tokens the hybrid favors

Why these differences occur: attention vs recurrent

Evaluation by token type: practical utility

What to learn and what's next

Original source

Stay up to date!

Which tokens does a hybrid AI model predict best