AI2's SciArena evaluates models on scientific literature

The number of papers being published is growing at an overwhelming pace: how do you filter what's relevant when you need grounded, cited answers? SciArena aims to do exactly that — to test language models on real scientific literature tasks with help from the research community.

What is SciArena

SciArena is an open platform for comparing foundational models' answers to questions about scientific literature. Researchers submit queries, see side-by-side answers generated by different models, and vote for the output that best answers the question. The goal isn't to judge chatty bots by style, but to measure real ability to reason and synthesize academic work. (allenai.org)

How it works

When you upload a question, SciArena uses a retrieval pipeline adapted from the Scholar QA system to fetch relevant snippets from papers. Those contexts and the question are sent to two models randomly selected, which generate long answers with citations. Outputs are standardized into plain text to reduce style bias, and then users vote in a blind comparison. In parallel, an -style ranking system keeps a dynamic leaderboard of model performance. ()

What is SciArena

How it works

Key results

Data quality and control

Limitations and future challenges

Why should you care?

Where to try it and read more

Stay up to date!

AI2's SciArena evaluates models on scientific literature