MoNaCo: benchmark that measures AI on complex questions

The Allen Institute for AI (Ai2) published on August 18, 2025 a new benchmark called MoNaCo designed to evaluate how language models handle real questions that require reasoning over dozens or hundreds of documents. Why does this matter now that everyone talks about LLM and information retrieval? Because MoNaCo focuses on what remains hard for AI. (allenai.org)

What is MoNaCo

MoNaCo (More Na tural and Co mplex questions) brings together 1,315 questions written by people who simulate real searches and that require many intermediate steps to solve. The answers aren't hidden on a single page: the solutions involve combining information from dozens, and sometimes hundreds, of Wikipedia pages. (allenai.org)

In addition, each question comes with a chain of reasoning annotated by humans (gold-standard reasoning chains). That means the authors provide not only the final answer, but the intermediate steps and the evidence (sentences and tables) that support each step. That turns MoNaCo into a useful tool both for evaluating models and for training systems that need to justify their answers. ()

What is MoNaCo

What they measured and what they found

Data that shows why this is hard

What this means for products and developers

How to access it and move forward

Stay up to date!

MoNaCo: benchmark that measures AI on complex questions