Fluid Benchmarking improves evaluation of language models

Imagine giving the same test to an elementary school student and to a university student. Does that make sense? Probably not.

The same happens when we use static evaluation sets for language models with very different capabilities. Researchers at the Allen Institute propose Fluid Benchmarking, an adaptive approach that selects items according to the model's level to measure it more precisely and at lower cost. (allenai.org)

What is Fluid Benchmarking

Fluid Benchmarking adapts ideas from psychometrics, in particular item response theory IRT, to the evaluation of language models. Instead of treating every question the same, the method learns two characteristics per item: difficulty and discrimination, and represents each model by a latent ability level.

This lets you compare models in an ability space instead of just looking at percent correct. (ar5iv.org)

How it works in plain terms

The evaluation starts with a question of average difficulty. Depending on whether the model gets it right or wrong, Fluid Benchmarking updates the estimate of its ability and picks the next question that will be most informative for that level.

To decide which item is most useful at each step it uses Fisher information; the final estimate is obtained with statistical methods like maximum likelihood estimation or MAP after the item budget is exhausted. The process is similar to adaptive exams used in education, but applied to AI models. (allenai.org)

Key results

On standard benchmarks, Fluid Benchmarking improves external validity and reduces evaluation variance. On MMLU, for example, it achieves higher validity and lower variance using up to fifty times fewer items than traditional evaluation. (ar5iv.org)
The method automatically avoids mislabeled items: the authors report a relative reduction in mislabeled items of nearly 99 percent, which boosts the reliability of results. (allenai.org)
During pretraining, Fluid Benchmarking adapts difficulty as the model progresses, reduces step-by-step variability, and produces smoother learning curves. That helps detect real improvements in later stages where classical accuracy saturates. (allenai.org)

Why it matters to you (researcher, founder, or curious reader)

Cheaper, more informative evaluations: if you train models, you can spend far less on tests while keeping or improving measurement quality.
Cleaner training signals: for teams monitoring checkpoints, less noise means faster decisions and better iterations.
Fairer leaderboards and comparisons: measuring in ability space reduces distortions from uninformative or incorrectly labeled items.
Reproducibility and tooling: the approach comes with code and data so you can replicate experiments or adapt them to your own benchmarks. A concrete example: the fluid-benchmarking repository includes utilities to fit IRT models and run the adaptive evaluation, with demo notebooks that let you try it in a few steps. (github.com)

Fluid Benchmarking proposes that evaluation shouldn't be a single, identical test for everyone, but a conversation with the model where questions change based on what we already know about its level. (ar5iv.org)

Resources and reading

Technical paper (arXiv): Fluid Language Model Benchmarking. (ar5iv.org)
Code and notebooks: GitHub - allenai/fluid-benchmarking. (github.com)
Data to reproduce experiments: Hugging Face Datasets - allenai/fluid-benchmarking. (huggingface.co)

So what now? Fluid Benchmarking isn't just a theoretical idea: it comes with implementations and reproducible results. If you work with model evaluations, it might be worth trying to spend less, get more stable measurements, and better understand what your models are actually learning.

Stay up to date!

Receive practical guides, fact-checks and AI analysis straight to your inbox, no technical jargon or fluff.