NeuroDiscoveryBench: AI benchmark for neural data | Keryc
Neuroscience today generates volumes of data that outpace traditional tools. NeuroDiscoveryBench appears as the first benchmark designed to measure how well AIs can answer questions that require direct analysis of open neural data, from single-cell atlases to connectivity maps.
What is NeuroDiscoveryBench and why it matters
NeuroDiscoveryBench is a test suite created by AI2 in collaboration with the Allen Institute. Its goal isn't to evaluate rote memorization, but the ability of AI systems to produce answers based on real analysis of data. That's different from other neuroscience benchmarks that test other skills; here, the questions require quantitative observations or scientific hypotheses obtained by processing the data.
You might ask: what kind of questions? Some examples included are analyzing the relationship between the APOE4 allele and Alzheimer’s pathology scores (ADNC), counting the distribution of donor genotypes for the Glut neurotransmitter class, or identifying the most frequent glutamatergic subclass in the RHP region. These answers can't be obtained with a web search or prior memory: they depend on the dataset provided.
How it was built and what it contains
Approximately 70 question-answer pairs based on three recent Allen Institute publications. Each question is tied to the dataset that makes it solvable.
Relevant questions were identified, the analytical workflows needed were reconstructed, and their executability was verified with an interactive data-analysis tool, Asta DataVoyager.
For complex queries, there are "raw" and "processed" versions of the data: the first requires transformations from the original state, the second uses preprocessed data to simplify analysis.
There is also a small subset of questions called "no-traces" that require, beyond analysis, a deeper biological understanding.
Neuroscience and data experts carefully reviewed the wording of questions and the gold answers to ensure they were clear, unambiguous, and faithful to what the data allow you to assert.
How a system is evaluated
The evaluation gives the system the question and the data, and expects a text answer or a figure when requested. For textual answers, the scoring function checks whether the context, variables, and relationships match the gold answer. For figures, a vision-language model is used to verify visual correctness.
This pipeline demands several combined capabilities: natural language understanding, data manipulation, code generation and execution, scientific reasoning, and common sense. In other words, it's multimodal and multifaceted.
Baselines and initial results
Three approaches were evaluated running autonomously:
No data: give the question to a language model without the dataset, to see if the model memorizes or infers without data.
No data with search: same as above, but allowing web searches.
DataVoyager: the tool interprets the query, generates transformations and code, executes the analysis on the dataset, and presents the answer.
Key results:
The no-data baselines scored low: 6% for "no data" and 8% for "no data with search" (evaluated with GPT-5.1, medium reasoning). That confirms most questions are not solvable without the dataset.
DataVoyager (also with GPT-5.1, medium reasoning, no web search) reached 35%. That's a notable improvement, but it shows the task remains hard.
Important observations: working with datasets in their raw form proved much more challenging; agents failed on complex transformations. Also, in some cases web search worsened performance by bringing up irrelevant papers that confused the model.
Technical lessons and opportunities
Preprocessing matters as much as the model. Tools that automate wrangling and cleaning will have an advantage with complex biological datasets.
Real integration of capabilities is needed: generating code, executing it, and reasoning about the results is the flow that produces valid answers.
Evaluating figures requires robust vision-language models to compare visualizations with the scientific expectation.
The "no-traces" questions show that collaboration between human experts and agents remains crucial for tasks involving deep biological knowledge.
If you work on building data-analysis agents, this tells you where to start: improve robustness in transformations, enrich semantic understanding of biological variables, and refine evaluation of multimodal outputs.
What’s next and why you might care
NeuroDiscoveryBench will soon become part of AstaBench, AI2's suite of benchmarks for scientific tasks. It's a shared reference point: it lets you compare tools, measure progress, and focus effort where AI still stumbles.
Does this mean AI replaces the scientist? No. It means AI can become a valuable assistant to speed up repetitive analyses and explore hypotheses, leaving researchers time for experimental design, interpretation, and lab work.
If you develop tools or research in neuroscience, NeuroDiscoveryBench offers a practical, reproducible testbed to measure how much analytical load can be delegated to agents today, and which engineering and scientific problems remain to solve.