AstaBench updates results and gains industrial adoption | Keryc
AstaBench publishes a new batch of results after evaluating frontier models on more than 2.4K scientific research problems. What do the numbers show about AI's real ability to do cutting‑edge science, and how useful is it today for you as a researcher or developer?
What is AstaBench?
AstaBench is an open benchmark designed to measure whether AI agents can do scientific research with substance and rigor. It's not just a list of tests: it's an evaluation framework, a set of problems, and a collection of baseline agents anyone can use and extend.
The benchmark evaluates four major categories:
literature search and understanding,
writing and running code,
dataset analysis,
and end-to-end discovery workflows.
All code, tools, and baseline agents are open source. The first release shipped with Asta and the paper was presented as an oral at ICLR 2026. The goal is to have a shared, reproducible measurement of whether AI can do science—not just isolated tasks.
New results: key numbers and technical reading
Models were tested using the ReAct framework and runs included Claude Opus 4.7, Opus 4.6, Sonnet 4.6, GPT-5.5, GPT-5.4 and Gemini 3.1 Pro Preview.
Aggregated results (overall score, and average cost per problem when reported):
Claude Opus 4.7: 58.0% (average cost $3.54/problem)
Claude Opus 4.6: 55.3%
Claude Sonnet 4.6: 54.5%
Asta v0 (baseline): 53.0%
GPT-5.5: 52.9% (average cost $1.61/problem)
Gemini 3.1 Pro Preview: 49.6%
GPT-5.4: 46.5%
A few important takeaways:
Top scores improved since the last round, but the benchmark is still far from solved.
Improvements aren't uniform across categories: the biggest gains are in Code & Execution and End-to-End Discovery; Data Analysis and Literature Understanding only improved moderately.
Costs rose noticeably, especially for the higher-performance Claude configurations.
GPT-5.5 raises the ceiling for non-Claude models on component tasks (code and analysis), but it still struggles with the harder end-to-end research workflows.
A technical detail worth noting: among the Claude runs, Opus 4.7 gains points but at a high cost. Opus 4.7 improves 2.7 points over Opus 4.6 in the overall score but costs roughly 62% more per problem. In End-to-End Discovery the advantage is 10.2 points, however it involves 54% more steps and 65% more cost. Part of the token increase is explained by a new tokenizer in Opus 4.7 that scales token counts by 1.0–1.35x for the same text.
GPT-5.5 offers a different quality‑to‑cost tradeoff: it's 5.1 points behind Opus 4.7 in the aggregate, but at less than half the cost per problem. In other words, it may be the most efficient choice depending on your quality‑cost tradeoff. Still, its performance in End-to-End Discovery shows that mastering components (code, literature, analysis) doesn't guarantee an agent can reliably complete full research workflows without failures.
In Data Analysis tasks the cost per problem stayed low—between $0.18 and $0.44 in frontier runs; End‑to‑End flows remain the most expensive.
A practical example: when AstaBench asked for E2E-Bench-Hard (take a research idea to working code and a report without scaffolding), the best original run perfectly completed only 3% of tasks end-to-end. With the new models that percentage rises, but it still shows that intermediate steps (search, write code, analyze, document) can work partially without the full chain reliably closing.
Update to the scoring model and transparency
AstaBench updated the models they use to score ScholarQA-CS2 and End-to-End Discovery following vendor-recommended upgrade paths. The new End-to-End Discovery scorer is stricter and penalizes fabricated results and placeholder code more consistently. This helps keep comparisons fair on the public leaderboard; historical scores were re-calibrated when necessary.
Note on costs: reported figures are the average cost measured by the benchmark under each agent configuration. They include differences in harness, tool usage and number of model calls. They are not a direct comparison of API prices between providers.
Industrial adoption: who is integrating AstaBench
AstaBench is already moving out of the lab. Recent adoptions and collaborations include:
UK AI Security Institute (UK AISI) and Arcadia Impact, working to incorporate AstaBench into Inspect Evals to make it easier for security researchers and developers.
General Reasoning integrated an AstaBench task (SUPER-Expert) as an environment in OpenReward, their platform for RL environments at scale.
Organizations that have submitted agents or shown interest: Elicit, SciSpace, Distyl AI and EvoScientist.
This makes AstaBench a contender for a de facto standard to evaluate agents' scientific capabilities, thanks to its openness and reproducibility.
Want to try your agent?
Everything you need is in the AstaBench and agent-baselines repositories. AstaBench accepts external submissions to the leaderboard and is working to make the process easier. If you build agents aimed at scientific research, this is a practical, public way to measure progress and compare approaches.
AstaBench doesn't claim AI can already replace scientists. Rather, it offers clear metrics to see which parts of the process AI does well today, which improve quickly, and where the biggest challenges remain to build agents truly capable of end-to-end research.