AI Evaluation Becomes the New Compute Bottleneck

Model evaluation is no longer the cheap formality many of us believed. Evaluating complex agents, scientific models or training-in-the-loop protocols can cost as much or more than training a model — and that changes who can audit, reproduce and validate results.

What's happening?

Can you imagine paying $40,000 just to run a battery of agents? That’s exactly what the Holistic Agent Leaderboard (HAL) reports: about 21,730 rollouts totaling roughly $40,000, and still growing. A single run in GAIA can cost $2,829 before caching. In other studies a sweep can cost $22,000 and show 33× differences in cost for apparently identical tasks.

Why does the price skyrocket? Because we no longer evaluate only the model but model × scaffold × token-budget, and small decisions (how you call the browser, whether you cache, how many agent steps) multiply the expense. Some benchmarks also require training inside the loop: The Well needs hundreds to thousands of H100 hours just to evaluate architectures and hyperparameter sweeps.

What's happening?

From static benchmarks to agents and training-in-the-loop

Numbers that hurt (relevant examples)

Reliability: the hidden multiplier

Implications for the community and governance

What can researchers do now? (technical and operational recommendations)

Technically, what’s still missing? (useful research lines)

Conclusion

Original source

Stay up to date!

AI Evaluation Becomes the New Compute Bottleneck