FrontierScience: OpenAI measures AI on research tasks | Keryc
OpenAI introduces FrontierScience, a new benchmark designed to evaluate whether artificial intelligences can reason at an expert level in physics, chemistry and biology. Can machines really help do science, or just repeat data? This evaluation aims to answer that question with hard problems designed by scientists.
Qué es FrontierScience y cómo funciona
FrontierScience is a collection of more than 700 questions written and verified by experts across the three main areas of experimental and theoretical science: physics, chemistry and biology. The goal isn't to test memory, but to measure scientific reasoning across two tracks:
Olympiad: 100 olympiad-style short-answer questions, written by international medalists to test precise, mathematical reasoning.
Research: 60 research-style subtasks, more open and complex, designed by PhDs and postdocs, evaluated with a 10-point rubric.
The idea is to cover both closed problems where the final answer can be verified quickly, and research tasks that require intermediate steps and verification of reasoning.
Cómo califican las respuestas
Olympiad-style questions use short answers (a number, expression or flexible match) to verify exactness.
Research questions use a rubric with several independent items, totaling 10 points; a solution is considered correct if it scores at least 7/10.
To scale the evaluation, OpenAI uses an automatic grader based on GPT-5 that compares answers against the rubric. This doesn't replace an expert human, but it allows evaluating a larger number of responses.
"The benchmark provides a compass to measure expert scientific reasoning, even though it doesn't cover everything a scientist does day to day."
Resultados principales y qué significan
The most advanced models showed notable progress, but with caveats:
GPT-5.2 scored 77% on the Olympiad track and 25% on Research, the best reported performance on this set.
Gemini 3 Pro was very close on Olympiad with 76%.
In previous benchmarks, improvement was fast: for example, on GPQA GPT-4 reached 39% in 2023 and GPT-5.2 hit 92% two years later.
These numbers indicate that AIs already solve high-level structured problems frequently, but there's still a long way to go for open-ended and creative tasks typical of real research.
¿Qué puede hacer la IA hoy en un laboratorio o proyecto de investigación?
Practical experience already shows concrete uses: multidisciplinary literature searches in multiple languages, help walking through complex mathematical proofs, and quick exploration of hypotheses that used to take days or weeks. In some cases, models like GPT-5 have measurably sped up stages of scientific work.
Does this mean AI will replace scientists? No. Models help speed up structured work and explore connections, but scientists remain essential to define problems, validate results and design real-world experiments.
Limitaciones importantes
FrontierScience advances evaluation, but it has clear limits:
It focuses on problems with relatively bounded statements; it doesn't fully measure the generation of genuinely new hypotheses.
It doesn't evaluate interaction with complex multimodal data (for example, video or real physical experiments).
The question-creation process included selection against internal models (they discarded tasks their models already solved), which can introduce bias in the comparison.
Using an automatic grader based on models speeds up evaluation, but it's not as objective as human review for long, open-ended tasks.
Qué significa esto para la comunidad científica y para ti
If you work in science or on products that use it, FrontierScience is a useful signal: models are already tools capable of accelerating parts of the workflow. But it also reminds us that human-machine collaboration is the safe path today: the AI suggests, the expert validates.
For the general public, it's a demonstration that AI is moving beyond search assistance and into the territory of complex reasoning. Are we ready to trust those suggestions? Not without human verification.
Hacia dónde va esto
OpenAI plans to iterate on FrontierScience, expand it to new areas and combine it with more realistic evaluations that show what new discoveries models help enable. In practice, progress in scientific reasoning will come from improving general systems and from focused efforts on scientific capabilities.
The true thermometer isn't a benchmark but the new discoveries AI helps generate and that scientists validate. FrontierScience offers a useful compass: it tells us where models excel, where they fail, and what we need to work on so they become reliable research partners.