While I was calibrating a run of Terminal-Bench 2.0 on a Google Kubernetes Engine cluster, we noticed something uncomfortable: scores didn’t match the official leaderboard and up to 6% of tasks failed due to pod errors unrelated to the model’s skill. Surprised? So were we. Just changing how resources were applied was enough to move results by several percentage points.
What they found
Agent evals for programming aren’t static tests. Here the model writes code, installs dependencies, runs tests and retries over multiple turns. The execution environment stops being a simple container and becomes part of the test.
Anthropic ran Terminal-Bench 2.0 with six resource configurations: from strictly applying the per-task spec (1x) to not imposing limits at all (uncapped). Everything else was the same: same Claude model, same harness, same task set.
The key results:
- Infrastructure error rates fell from 5.8% at 1x to 0.5% with uncapped.
- From 1x to 3x, infra errors dropped from 5.8% to 2.1% (p < 0.001).
- The total jump from 1x to uncapped was +6 percentage points in success (p < 0.01).
- In SWE-bench, increasing RAM up to 5x had a smaller but present effect: +1.54 percentage points in 5x vs 1x.
They also observed that from around 3x the scores climb faster than the simple drop in infra errors. In other words, with more headroom agents don’t just stop failing due to infra issues: they start using strategies that only work with generous resources.
Why implementation details matter
You can’t just publish a per-task resource spec and call it a day. In container runtimes there are two different parameters: the guaranteed allocation (reservation) and the hard limit at which the container is killed (kill threshold). If both are equal, there’s no headroom for transient spikes: a small memory fluctuation can cause an OOM-kill, and the task dies before the agent solves anything.
In practice, different sandboxing providers apply these rules differently. The provider used by the Terminal-Bench leaderboard allows temporary overcommit and avoids immediate kills, making execution more stable. Our Kubernetes implementation, by contrast, treated the spec as both floor and ceiling and that produced avoidable failures.
What restrictions change: efficiency versus brute force
Strict limits favor agents that write very lean, efficient code. Generous limits favor agents that install large dependencies, spawn costly processes or run heavy test suites. Both styles are valid, but collapsing them into a single number without saying under which conditions the run happened makes it hard to interpret real generalizability.
A concrete example: in the bn-fit-modify task some agents install pandas, networkx and scikit-learn and do fine with headroom. Under strict limits the pod runs out of memory during installation, and the agent doesn’t even reach the first line of code. Other agents implement the math using the standard library and pass with fewer resources. The configuration decides which approach wins.
Other sources of variation
Infrastructure isn’t just memory and CPU. Time limits, concurrency, egress bandwidth, API latency by time of day and the overall health of the cluster can all influence results. Anthropic documented anecdotal fluctuations by time of day, likely due to changes in API latency. In short, agentic evals are end-to-end tests: any part of the system can be a confounder.
A model can lose because of an OOM-kill that doesn’t reflect its ability, but the details of how the runtime applies limits.
Practical recommendations (technical)
-
Specify two parameters per task:
guaranteed_allocationandhard_kill_threshold. Don’t use a single value that is both floor and ceiling. -
Calibrate the margin between them. In Terminal-Bench 2.0, a ceiling of 3x over the per-task spec reduced infra errors from 5.8% to 2.1% (p < 0.001) and kept the score gain within noise (p = 0.40). That kind of empirical tuning is the right way to find a tradeoff between stability and resource pressure.
-
Run evals multiple times and on different days. Multiple runs and averaging help smooth network noise, latency and occasional cluster events.
-
Publish both the recommended specs and the enforcement methodology (which runtime, how you handle overcommit and the actual parameters used). Without that, comparisons across leaderboards are fragile.
-
Treat resource configuration as an experimental variable, at the same level as the prompt or sampling temperature.
Implications for people who use or maintain benchmarks
If you use scores to decide which model to deploy, keep in mind that a 1–2 percentage point difference can be within infrastructural noise. Until methodology is standard or reproducible, differences below about 3 points deserve skepticism, especially if execution details aren’t published.
For eval maintainers: publishing recommended resources is a good first step. Publishing how those resources are applied (guarantee vs limit, sandbox provider, overcommit policies) would close the gap that allows misleading interpretations.
Final reflection
Agentic evals are useful because they test the whole system, but that’s exactly what makes them fragile to infrastructure details. This isn’t magic: a couple of gigabytes of memory or a different overcommit policy can decide who tops a leaderboard.
Do you want to measure the model’s capacity or the power of the VM around it? The answer should be explicit on the benchmark’s spec sheet.
