The Allen Institute for AI (Ai2) published on August 18, 2025 a new benchmark called MoNaCo designed to evaluate how language models handle real questions that require reasoning over dozens or hundreds of documents. Why does this matter now that everyone talks about LLM
and information retrieval? Because MoNaCo focuses on what remains hard for AI. (allenai.org)
What is MoNaCo
MoNaCo (More Na tural and Co mplex questions) brings together 1,315 questions written by people who simulate real searches and that require many intermediate steps to solve. The answers aren't hidden on a single page: the solutions involve combining information from dozens, and sometimes hundreds, of Wikipedia pages. (allenai.org)
In addition, each question comes with a chain of reasoning annotated by humans (gold-standard reasoning chains). That means the authors provide not only the final answer, but the intermediate steps and the evidence (sentences and tables) that support each step. That turns MoNaCo into a useful tool both for evaluating models and for training systems that need to justify their answers. (allenai.org)
What they measured and what they found
Ai2 tested 15 cutting-edge models on MoNaCo (including GPT-5, o3, Claude Opus 4, Gemini 2.5 Pro and Deepseek-R1) and the results made it clear that models still struggle with this type of question. The best model in their experiment, o3, reached an F1
of 61.2% and perfectly completed only 38.7% of the examples. In plain terms: even the strongest models fail frequently when the task demands many checks and synthesis. (allenai.org)
When evaluators gave the models all the correct evidence (Oracle retrieval
scenario), performance rose by about 10 points, but models still only reached 58.7% F1
. And in end-to-end RAG scenarios using a real BM25 retriever, performance dropped drastically due to retrieval and robustness issues. In other words, the problem isn't just reasoning well: it's also finding the right evidence. (allenai.org)
Data that shows why this is hard
In MoNaCo, each question relies on many pages: on average 43.3 documents per question (median 12). The evidence is mixed: sentences, tables and lists; tables in particular make up a large part of the supporting material. There are also 40K boolean questions and more than 90K annotated intermediate subquestions. All of this makes the dataset broad and demanding of a model's ability to decompose tasks and combine heterogeneous facts. (allenai.org)
A concrete example: a question about whether left-wing parties in European countries are more often led by women than right-wing parties requires reviewing hundreds of pages (the cited example involves the equivalent of 719 Wikipedia pages). Can you imagine doing that by hand? For an LLM
it's also an enormous challenge. (allenai.org)
What this means for products and developers
- For search-style products or assistants that use
RAG
: having a good model isn't enough; you need a robust retriever and ways to filter partial evidence. - For researchers: MoNaCo's human reasoning chains are valuable for training and auditing models that must justify answers.
- For users and companies:
LLM
answers may feel fast, but for complex tasks you still need to see the evidence and verify intermediate steps before you fully trust the response. (allenai.org)
How to access it and move forward
MoNaCo is publicly available and Ai2 shares the benchmark along with the project page and the associated paper (and resources on HuggingFace). It's an open invitation to the community to evaluate models, improve retrievers and build more attributable, factual systems. (allenai.org)
MoNaCo reminds us of something important: AI is already good at shortcuts, but long questions that require many sources remain active research territory.
Are you interested in trying it out or want me to explain how to use MoNaCo to evaluate a model or a retrieval system? I can guide you step by step.