Evaluating AI Agents in Scientific Discovery | Keryc
From social media to academic conferences, AI agents that design experiments, write code, and even draft papers are being announced. But are they really doing science, or just mimicking loose steps of the scientific process?
What ScienceWorld and DiscoveryWorld measure
Ai2 developed two key benchmarks to answer that question: ScienceWorld (2022) and DiscoveryWorld (2024). Both are simulated textual worlds where an agent must carry out scientific tasks, but they target different levels of complexity.
ScienceWorld recreates elementary-level experiments: measuring boiling points, mixing substances, tinkering with circuits, and Mendelian genetics. Agents interact with objects that behave according to simple physics and chemistry, and must plan and execute steps to obtain a measurement or reproduce a finding.
DiscoveryWorld frames end-to-end research in fictional contexts (Planet X). Here the agent has to generate hypotheses, design experiments, run long routines, and justify results in areas like proteomics, radioisotope dating, or epidemiology.
Both use randomized setups to force generalization: memorizing solutions isn't enough.
Metrics and why they matter
The benchmarks don't just check if the agent answers correctly. They measure several dimensions so you can judge real capability:
Task completion rate.
Fidelity to the scientific process: did the agent follow a reasonable experimental design?
True understanding versus luck: can it explain why it got that result?
Robustness and generalization under parametric variation.
These metrics separate "test-taking intelligence"—answering closed questions—from "experimental intelligence"—planning and executing long investigations.
As Peter Jansen (Ai2) sums it up: understanding a concept and applying it in an experiment are distinct skills.
Results: real progress, but still far
When ScienceWorld appeared, models that passed science exams failed more than 90% of practical tasks. In a few years we saw notable improvements: suites like TALES (Microsoft Research) reported scores in the low 80s on ScienceWorld by early 2025. That shows models are learning to plan and execute longer action sequences.
But DiscoveryWorld still exposes clear limits: on harder tasks, the best agents complete around 20% of the challenges, while trained human scientists solve about 70% on average. That tells you agents haven't mastered open-ended research with real ambiguity.
Technical design that explains the difficulty
A few technical reasons these benchmarks are demanding:
Long-horizon planning: many investigations require hundreds of actions and hierarchical planning.
Exploration vs. exploitation: the agent must balance testing new hypotheses with deepening promising leads.
Process evaluation: a final answer isn't enough; the method must be judged.
Variability and randomness: parameters are reconfigured to prevent overfitting.
In evaluation, it's useful to distinguish raw performance metrics (e.g., accuracy) from process metrics (e.g., adherence to experimental protocol), and to add measures of computational cost and sample efficiency when relevant.
Practical implications for researchers and developers
If you work on AI agents for science, what should you do now?
Evaluate in different environments: test both ScienceWorld and DiscoveryWorld to cover basic experimental skills and end-to-end research.
Report process metrics: document not only whether the agent solved the task, but how it did so.
Prioritize generalization: use parametric variations and multiple seeds to measure robustness.
Investigate hierarchical planning and long-term memory: hybrid architectures (LLMs + symbolic planners or RL modules with memory) tend to improve on long tasks.
Measure cost and latency: practical viability depends on price-performance and inference latency.
What does this tell us about the future of scientific agents?
There is palpable progress: a jump from sub-10% to ~80% in ScienceWorld over three years is significant. But DiscoveryWorld reminds us that understanding and doing science in complex settings is more than producing correct answers. We need agents that can plan, handle uncertainty, and explain their reasoning.
If the goal is to help cure diseases or discover materials, you have to pass these basic simulation tests first. Ai2's open benchmarks provide a playground where promising ideas can become reproducible results.
Final reflection
AI for science is moving from flashy headlines to a measurable field. ScienceWorld and DiscoveryWorld give you tools to tell apart spectacular claims from real capabilities. Want to know if your agent truly does science? Put it to the test where generalization, process, and explanation matter.