Theorizer creates scientific theories from thousands of papers | Keryc
Imagine that in minutes you can get structured hypotheses about an entire research area instead of spending months reading papers. Sound like science fiction? Theorizer puts it in experimental form: a tool that synthesizes scientific laws from academic literature using a multi-LLM pipeline.
What Theorizer does and why it matters
Theorizer doesn't just summarize articles. Its goal is to identify regularities that repeat across studies and turn them into testable statements. Each output is organized into tuples LAW, SCOPE, EVIDENCE:
LAW: the qualitative or quantitative claim (for example, 'X increases Y' or a numeric interval).
SCOPE: conditions and limits where the law applies (e.g., 'only for small R', or 'not valid if P is present').
EVIDENCE: the empirical support extracted and traced back to specific papers.
Theorizer also generates a name and a high-level description to place the theory in the literature landscape. Think of this as compressing hundreds of findings into a law with clear conditions—similar to how Kepler's laws condensed centuries of observations.
Architecture and pipeline (high level)
The pipeline has three main stages:
Literature and discovery: starts from a user query ('make me theories about X') and retrieves up to 100 relevant papers. It uses query reformulation, PaperFinder, Semantic Scholar for open-access PDFs, and an OCR flow to extract text. If the list is short, it expands the search by following references inside papers.
Evidence extraction: Theorizer builds an extraction schema specific to the query (entities, variables, relevant outcomes). A cheap model fills that schema for each paper and produces JSON records that feed the synthesis.
Synthesis and refinement: it aggregates the evidence and generates theories with a preconfigured prompt. Then it applies a self-reflection step to improve consistency, evidence attribution, and specificity. It also produces self-assessments of novelty and filters laws that are too close to known statements. If the evidence set exceeds the context window, the evidence is randomly subsampled.
Models and technical components
In the reference they use GPT-4.1 for schema generation, theories and reflection, and GPT-5 mini for massive evidence extraction. Other components are PaperFinder, Semantic Scholar, OCR, and a reference-backfill pipeline. Extraction output is in JSON, which makes aggregation and quantitative analysis easier.
How they evaluated quality and prediction
They defined five desiderata for a good theory: specificity, empirical support, predictive ability, novelty and plausibility. Since testing thousands of theories with real experiments isn't feasible, they used two complementary approaches:
LLM-as-a-judge: models rate theory quality according to the five criteria. They compared parametric generation (only what the model already knows) vs. literature-backed generation (Theorizer's default mode).
Backtesting for predictive accuracy: they generate predictions from each law, search for later papers that could verify them, and judge whether each paper supports, contradicts, or doesn't inform the claim. From this they estimate precision and recall.
Key results:
The literature-backed version is almost 7 times more expensive than the parametric one, but produces more accurate and predictive theories.
When focusing on precision, both approaches show high precision (approx. 0.88–0.90). The difference appears in recall: literature-backed achieves ~0.51 vs ~0.45 for parametric.
In novelty mode, the impact is larger: precision rises from 0.34 to 0.61 and recall from 0.04 to 0.16 when literature is added.
Saturation and diversity: parametric generation recycles known facts and saturates quickly. After 40 theories, the combined method maintains more diversity; 32% of statements are non-duplicated.
In backtesting, they tested 2,983 laws against 4,554 papers in 16,713 law-paper evaluations, using the first year after GPT-4.1's knowledge cutoff as the temporal window and reserving the most recent 6 months for evaluation.
Costs, limitations and technical risks
Time and cost: each query takes roughly 15–30 minutes in the experiment setup and is parallelizable, but the literature-retrieval version increases resource use and costs.
Coverage: it depends on open-access papers, so it works better in fields with lots of open literature, like AI/NLP.
Biases and false positives: the literature favors positive results, which makes finding contradictory evidence harder. Theorizer can produce partially correct or misleading theories. Treat outputs as hypotheses to explore, not definitive truths.
Evidence subsampling: when evidence exceeds the context window, items are chosen randomly; that can leave out relevant studies.
What they publish and how it can help you
AllenAI publishes Theorizer's code on GitHub, with UI, API and all prompts. The reference pipeline used GPT-4.1 and GPT-5 mini, but you don't have to use those models. They also release a dataset of roughly 3,000 theories (2,856 in the reported run) synthesized from 13,744 papers using 100 representative queries. Each theory includes LLM summaries of the evidence per paper.
If you work in AI/NLP, this can be a starting point to quickly explore emerging patterns. If you're a researcher, Theorizer can speed up identifying gaps and generating predictions for backtesting or new experiments.
Important: Theorizer is a research tool. Its outputs are algorithmically generated hypotheses, useful to guide human work, not to replace experimental validation.
Final reflection
The central idea is simple but powerful: when literature grows faster than any person can read, automating synthesis into structured theories makes sense. Theorizer doesn't replace scientific judgment; it enhances it, offering a shortcut to explore, prioritize and design experiments at scale. Are we ready to accept machine-suggested laws? Not without validating them, but we can accept them as a magnifying glass that helps you spot patterns that would otherwise stay scattered across hundreds of papers.