In short: Ai2 (Allen Institute for AI) published a new analysis about how to decide which benchmarks actually help when you compare language models. The main idea? Not every test is equally useful when you run small-scale experiments and then scale up, and there’s a simple way to measure that reliability. (allenai.org)
What Ai2 announced
Ai2 introduces the idea of measuring the relationship between signal and noise to evaluate which benchmarks are reliable for design and scaling decisions. They published a blog post with the results, the data (900,000 evaluations) and the code so others can reproduce it. (allenai.org)
What is SNR
and why does it matter?
SNR
means signal-to-noise ratio. Here: signal is how well a benchmark separates models from each other (that is, how much useful spread there is in scores), and noise is the random variability that appears during training and can hide those differences.
If noise is high and signal is low, it’s easy to be wrong when you decide which training change is better. Sound familiar? Have you ever run a bunch of small experiments and then seen the results flip when you scale? Are you comparing noise or real progress? (allenai.org)
If you’ve run small experiments and later saw behavior not hold when scaling, this matters to you. Are you comparing noise or true progress?
Key findings
-
Ai2 shows that
SNR
predicts whether a benchmark will be useful to decide between small models or to fit scaling laws. In their tests, theR^2
of the prediction was high in several scenarios, suggestingSNR
is informative. (allenai.org) -
Not every task that works well at large scale is good for small experiments. Tasks like
ARC Easy
help at small scale, whileHumanEval
andMATH 500
are more useful at large scale. (allenai.org) -
Ai2 released a large collection of evaluations over 465 open-weight models and intermediate checkpoints to compute
SNR
and enable reproducibility. (allenai.org)
Practical interventions that proved to work
Ai2 tests two simple interventions to improve the SNR
of an evaluation:
-
Filter noisy subtasks: in composite benchmarks (for example
MMLU
) not every subtask contributes signal. Ordering subtasks bySNR
and using only the best ones can increase the signal-to-noise ratio and reduce decision errors by up to 32% for MMLU in their tests. (allenai.org) -
Change the evaluation metric: for generative benchmarks with math or code problems, using
bits per byte
(BPB) over the human answer raisesSNR
a lot (for example GSM8K goes from 1.2 to 7.0; MBPP from 2.0 to 41.8 in their experiment), and it improves decision consistency when scaling. (allenai.org)
Together, applying these ideas improved small-scale decision accuracy for most benchmarks and reduced prediction error on many tasks. (allenai.org)
What does this mean for you, whether you're a researcher, engineer, or founder?
-
If you work with small experiments to pick architectures or datasets, measuring
SNR
can save you time and compute. Instead of trusting the mean of a large benchmark, prioritize subtasks with highSNR
. -
If you’re responsible for evaluation pipelines, consider adding an automatic step that estimates
SNR
with intermediate checkpoints. That tells you whether the benchmark will give a clear signal or just noise. -
If you’re a founder comparing models or services, ask for metrics that include
SNR
analysis or show stability across checkpoints. You’ll avoid decisions based on random fluctuations.
How to get started today (practical steps)
-
Evaluate several final checkpoints of your training and compute the standard deviation to estimate
noise
. -
Compute the spread between models (for example, maximum pairwise difference) to estimate
signal
, and deriveSNR = signal / noise
. -
Rank subtasks by
SNR
and test performance using only the top ones. See if your small-scale choices hold when you scale. -
Try alternative metrics like
BPB
for generative tasks if you see lots of noise in traditional metrics.
If you want to reproduce what Ai2 did, they published the data and code in their blog post. [Read the article and download the resources at Ai2].(https://allenai.org/blog/signal-noise) (allenai.org)
Final reflection
Evaluation is as important as training. Measuring how much useful signal a benchmark has versus noise isn’t just a statistical curiosity: it’s a practical tool to make better decisions with less cost. Wouldn’t you rather know when the numbers you see actually matter?