PreScience: AI that predicts the course of science | Keryc
PreScience is a new testbed to evaluate whether artificial intelligence can anticipate, step by step, how science advances. Can a model, given the scientific record up to a fixed point, predict everything from who will form teams to how much impact a paper will have one year later? That’s the question this project—developed by AI2 and the University of Chicago with support from the NSF—sets out to answer.
Qué es PreScience y por qué importa
The idea is simple and ambitious: break a scientific advance into four chained decisions that reflect how science is actually done. Instead of judging isolated tasks, PreScience treats the full sequence: team formation, selection of prior work, generation of the contribution, and impact prediction. This lets you simulate the month-by-month evolution of a field.
Why should you care? If we want AI tools that truly help discover new things, we need to test their ability to anticipate the whole process—not just write convincing abstracts or estimate citations.
Diseño del dataset y garantías contra filtrado de información
The benchmark is built from real arXiv papers across seven AI subcategories (e.g., NLP, ML, computer vision). Some key numbers:
Training coverage up to October 2024 and evaluation on the following year to force real forecasting.
~100,000 target papers published between October 2023 and October 2025.
A larger corpus of over 500,000 papers and about 183,000 unique authors.
To avoid shortcuts and information leaks, PreScience applies several technical safeguards:
Author disambiguation with a method that improves identity clustering quality.
Filtering target papers to those with between 1 and 10 key references, avoiding trivial or impossible outliers.
Explicit temporal alignment of metadata (citations, h-index, histories) so models don’t see information after the target date.
That means evaluations reflect true forecasting ability, not just remembering what already exists.
Las cuatro tareas composables
PreScience decomposes an advance into four tasks that can be evaluated separately or chained in a full simulation:
Collaborator prediction: given an author and the state of the field, who will they work with next? This aims to capture the social and topical dynamics of team formation.
Prior work selection: given a team, which prior works will they cite? This is a ranking/relevance task over the existing literature.
Contribution generation: with the team and references fixed, what title and abstract will the paper produce? This is where language generation and scientific quality matter.
Impact prediction: once the paper exists, how many citations will it get in its first year? This is a regression task about future attention.
The tasks can be chained into a ‘science simulator’ that, month by month, predicts teams, generates papers, and reincorporates them into the corpus.
Medir calidad de las contribuciones: LACERScore
Comparing abstracts with surface-similarity metrics isn’t enough. That’s why PreScience introduces LACERScore, a calibrated 1–10 score where a language model acts as a judge guided by automatically generated reference examples that anchor each level.
LACERScore is designed to approximate human judgment and, according to the authors, reaches levels close to inter-annotator agreement—outperforming previous metrics like ROUGE or BERTScore on this specific task.
Resultados técnicos clave
Experiments show that even with state-of-the-art models there’s a lot of room for improvement at every stage:
Collaborator prediction: a simple heuristic based on past coauthorship frequency beats more complex ML models. Predicting collaborations between people who have never worked together is still hard.
Prior work selection: the best baseline gets an nDCG of roughly 0.13, indicating that identifying exactly which papers a team will cite is a difficult task.
Contribution generation: large models produce plausible abstracts but remain noticeably distant from the real ones. GPT-5, the best tested, averaged about ~5.6/10 on LACERScore. Interestingly, a simple paraphrase of the real abstract scores much higher, highlighting the gap between generation and what authors actually wrote.
Impact prediction: there are useful predictive signals, but errors are significant, and highly cited papers are the hardest to anticipate accurately.
La simulación de 12 meses: dónde falla la IA en la secuencia completa
The most revealing test was chaining the four stages into 12-month simulations. Main result: simulated science is systematically less diverse and less novel than real science.
Diagnosed points:
Upstream stages (team formation and literature selection) produce more diversity than found in practice. In other words, inputs were diverse.
The diversity drop happens at the generation stage. Given a diverse set of inputs, the model tends to produce outputs that are more homogeneous than what humans write.
In short: it’s not that models can’t imagine unusual teams or mix disparate papers; it’s that when producing the concrete contribution, they converge toward similar variants and lose novelty.
Implicaciones y próximos retos técnicos
PreScience highlights challenges that matter if we want AI to truly amplify scientific discovery:
Predicting first-time collaborations requires models that integrate social, institutional, and network-topology signals beyond historical coauthorship.
Better retrieval and reasoning over the literature are needed to raise nDCG in prior work selection.
Generation needs mechanisms that encourage conceptual diversity and creative risk, not just linguistic fluency.
Predicting impact requires understanding diffusion, networks, and editorial and community signaling factors.
Technically, the authors suggest exploring richer contexts (affiliations, venues, funding) and multimodal artifacts (figures, tables) to improve prediction fidelity.
Cómo puedes involucrarte hoy
PreScience is a living benchmark: it includes training and test corpora, author mappings, baseline implementations, and evaluation scripts. If you work on AI applied to science or recommendation systems for researchers, this is a valuable resource to test ideas that don’t just optimize a single metric but improve the whole research process.
Are you wondering whether AI will push science toward narrower pathways or bolder exploration? PreScience shows that, for now, the main limitation is creative generation. Improving that isn’t just a technical problem; it means redesigning training and evaluation objectives to value novelty and diversity.