Imagine giving the same test to an elementary school student and to a university student. Does that make sense? Probably not.
The same happens when we use static evaluation sets for language models with very different capabilities. Researchers at the Allen Institute propose Fluid Benchmarking, an adaptive approach that selects items according to the model's level to measure it more precisely and at lower cost. (allenai.org)
What is Fluid Benchmarking
Fluid Benchmarking adapts ideas from psychometrics, in particular item response theory IRT, to the evaluation of language models. Instead of treating every question the same, the method learns two characteristics per item: difficulty and discrimination, and represents each model by a latent ability level.
