Which tokens does a hybrid AI model predict better?

Hybrid models are gaining ground, but what do they actually do better: remember exact words or follow the semantic thread of a text? Ai2 compared its strongest transformer (Olmo 3, 7B) with Olmo Hybrid to answer exactly that, token by token.

Resumen del experimento

Think of it like this: two models built with the same data, the same tokenizer and the same training recipe, but with different architectures. The behavioral differences you see between them likely come from the architecture itself. Ai2 put Olmo 3 (transformer) and Olmo Hybrid head-to-head on a battery of texts — articles, Wikipedia entries, books, papers and structured text like Python, HTML and LaTeX — and measured, for each token, which model gave higher probability to the actual next token.

The key measure is the loss gap: the difference in loss between models for each token. If the loss gap is positive, the hybrid predicts better; if it's negative, the transformer wins.

Resumen del experimento

Atención versus recurrencia — y cómo se distingue su influencia

Qué tokens favorecen a cada arquitectura

Evaluaciones filtradas: una forma más sensible de comparar arquitecturas

Implicaciones prácticas y recomendaciones

Hacia dónde va esto

Fuente original

Stay up to date!

Which tokens does a hybrid AI model predict better?