AI Evaluation Becomes the New Compute Bottleneck | Keryc
Model evaluation is no longer the cheap formality many of us believed. Evaluating complex agents, scientific models or training-in-the-loop protocols can cost as much or more than training a model — and that changes who can audit, reproduce and validate results.
What's happening?
Can you imagine paying $40,000 just to run a battery of agents? That’s exactly what the Holistic Agent Leaderboard (HAL) reports: about 21,730 rollouts totaling roughly $40,000, and still growing. A single run in GAIA can cost $2,829 before caching. In other studies a sweep can cost $22,000 and show 33× differences in cost for apparently identical tasks.
Why does the price skyrocket? Because we no longer evaluate only the model but model × scaffold × token-budget, and small decisions (how you call the browser, whether you cache, how many agent steps) multiply the expense. Some benchmarks also require training inside the loop: The Well needs hundreds to thousands of H100 hours just to evaluate architectures and hyperparameter sweeps.
From static benchmarks to agents and training-in-the-loop
Before agents became common, compression techniques worked well: HELM was expensive, but later work showed you could reduce costs 100×–200× while keeping relative order. Methods like Flash-HELM, Item Response Theory, tinyBenchmarks and Anchor Points proved that a few anchor examples could preserve rankings.
With agents that breaks down. Rollouts are multi-turn, noisy and scaffold-sensitive. Mid-difficulty filtering can save 2×–3.5× while keeping fidelity, but that’s far from the 100× of static benchmarks. And when a benchmark demands training models (The Well, PaperBench, MLE-Bench), compression almost disappears: each evaluation may need tens or hundreds of full trainings.
Numbers that hurt (relevant examples)
HAL: ~$40,000 for ~21,730 rollouts (9 models × 9 benchmarks). With k = 8 reruns per cell, cost jumps to ~$320,000.
GAIA: one run on a frontier model can cost ~$2,829.
The Well: 960 H100-h for a new architecture ($2,400 with conservative conversions); the full sweep: 3,840 H100-h ($9,600).
PaperBench: a full run is around $9,500; variants without execution drop to ~$4,200.
MLE-Bench: one seed can cost ~$5,500; multi-seed and multi-model sweeps reach six figures.
There are also enormous disparities in per-token prices among commercial LLMs: input/output fees can vary by two orders of magnitude, so the same experiment can cost 10× or 100× depending on provider and setup.
Reliability: the hidden multiplier
Do you trust a single run? You shouldn’t. Consistency matters and it costs. Experiments show severe drops when moving from 1 to 8 repetitions (for example, 60% to 25% in some cases). Adding seeds, reruns or holdout protocols multiplies the budget: what looked doable with $40k becomes $320k for a statistically credible evaluation.
The practical consequence is clear: many groups can’t rigorously audit frontier agents.
Implications for the community and governance
Access and accountability: the economic barrier concentrates evaluation capacity in well-funded labs, reducing external, independent validation.
Leaderboards that ignore cost incentivize token waste: if you only care about accuracy, why not spend more until you gain a point? Pareto frontiers (accuracy vs cost) fix this, but they aren’t the norm.
Repeated work: the community pays repeatedly for the same baseline because outputs aren’t shared granularly (traces, logs, seeds, scaffold).
What can researchers do now? (technical and operational recommendations)
Publish traces and artifacts: export rollout logs, tool-call traces and grading traces in a shared schema so others can reuse results.
Use coarse-to-fine strategies: run cheap evaluations first (Flash-HELM style) and reserve heavy compute for top candidates.
Adopt Pareto-efficient leaderboards: report accuracy alongside cost and token-budget per cell.
Mid-difficulty filtering and anchor items to reduce items without losing relative order when possible.
Cache and memoization: avoid paying for repeated I/O and tokens by caching prompts/responses and intermediate steps.
Tabular precomputation where possible (historical example: NAS-Bench-101), so repeated operations are cheap.
Budget limits and standardized multi-seed protocols: define a minimum k for reliability and report confidence intervals.
Standardized result formats: enabling 2× reuse can save more than many compression techniques combined.
Technically, what’s still missing? (useful research lines)
Compression methods that survive sequential noise and scaffold sensitivity in agents.
Evaluation protocols that treat cost as a first-class metric (e.g., accuracy@cost or computational-efficiency curves).
Tools to share and verify traces efficiently and privately (e.g., artifact storage with hashes and signatures).
Inference-scaling studies: when do extra tokens actually improve solution quality and when do they only raise the bill?
Technical research can help, but many solutions require practice changes: publish more, measure cost and build infra to share results.
Conclusion
Evaluation has stopped being the cheap partner of development and has become a factor that decides who can validate powerful AI. Do we want only model builders to be the ones who also evaluate them? If the answer is no, the community must change practices: measure and report cost, share traces, and adopt reproducible, cost-aware protocols.