Gaia2 and ARE: new benchmark to evaluate AI agents

In practice, AI agents are no longer just text answers; they're systems that need to act, use tools and adapt when things go wrong. But how do you test that without going crazy with environments that are either too rigid or unrealistic? Hugging Face introduces Gaia2 together with ARE to answer exactly that problem.

What Gaia2 is and why it matters

Gaia2 is the evolution of the GAIA benchmark: it moves from a read-only question set to a read-and-write challenge where agents interact with simulated apps, handle ambiguity, meet deadlines and respond to failures in real time. The goal isn't just to measure whether a model can search, but how it plans, executes and adapts under noisy, changing conditions. (huggingface.co)

Gaia2 includes hundreds of human-crafted scenarios grouped into concrete capabilities: execution, search, ambiguity handling, adaptability, temporal reasoning, agent-to-agent collaboration and noise tolerance. That lets you test things closer to what a real assistant would have to do. (huggingface.co)

ARE: the environment that makes the simulation believable

ARE is the execution environment that accompanies the benchmark. Imagine a simulated mobile phone with apps like email, calendar, contacts and a messaging system: everything is prepopulated and available for tool calls by the agent. ARE records structured traces (tool calls, responses, the model’s “thoughts”, timings, interactions) that you can export as JSON for debugging and analysis. (huggingface.co)

Why is that useful? Because you no longer depend only on a final metric: you can see step by step where the agent got lost, whether it took a thousand tokens to reach the solution or if it failed when the simulated API returned an error. That turns ARE into a toolbox for building more robust assistants. (huggingface.co)

Gaia2 is intended for dynamic, temporal scenarios; the dataset is released under CC BY 4.0 and ARE under the MIT license, making it easy for the community to use and extend. (huggingface.co)

Key results: which models and what remains

The published tests compared open and closed models (Llama 3.3-70B, Llama-4-Maverick, GPT-4o, Qwen3-235B-MoE, Grok-4, Kimi K2, Gemini 2.5 Pro, Claude 4 Sonnet and GPT-5). The overall top at the time of the article was GPT-5 in high-reasoning mode; among open models, Kimi K2 performed best. However, the toughest areas remained ambiguity, adaptability and, above all, temporal tasks. (huggingface.co)

One interesting point: Gaia2 doesn’t just report raw scores. It also normalizes by cost and time (LLM calls and output tokens), because an agent that reaches the answer after a thousand calls isn’t as useful as one that does it quickly and cheaply. That focuses on practical performance, not just accuracy. (huggingface.co)

Want to try Gaia2 with your model? Quick steps

Install the recommended framework (Meta ARE):

pip install meta-agents-research-environments

Run the benchmark with are-benchmark, configuring the split and your model. In the blog they test all configurations (execution, search, adaptability, time, ambiguity) and upload results to the hub for centralized logging. Short example of the command they show:

are-benchmark run --hf meta-agents-research-environments/Gaia2 \
  --split validation --config CONFIGURATION \
  --model YOUR_MODEL --model_provider YOUR_PROVIDER \
  --agent default --max_concurrent_scenarios 2 \
  --scenario_timeout 300 --output_dir ./monitored_test_results \
  --hf_upload YOUR_DATASET_ON_HF

Judge and add your result to the leaderboard with the judge command and share traces for collaborative analysis. You can also use the web demo to try it without installing anything. (huggingface.co)

If you want the dataset or the demo: the Gaia2 dataset is available on Hugging Face and the ARE code is on GitHub, with instructions to get started. (huggingface.co)

Practical ideas and precautions

For product teams: use Gaia2/ARE to validate critical flows (scheduling appointments, updating contacts, time-sensitive actions). You’ll see if the agent understands a 3-minute deadline or if it breaks when a service fails.
For researchers: the JSON traces are gold for studying chains of reasoning, orchestration failures and for generating fine-tuning data targeted at tool-calling.
Security caution: by default agents are json agents and cannot modify your machine, but if you connect MCPs or external tools with real permissions, do it carefully. Never grant unnecessary permissions to an agent under test. (huggingface.co)

To close (quick reflection)

Gaia2 and ARE don’t promise perfect assistants overnight. What they provide is infrastructure to test, understand and improve agents in less idealized conditions. If you work with conversational assistants or tools that perform actions, this shifts the question: it’s no longer just "how much do they know" but "how do they act under pressure". That’s exactly what Gaia2 measures. (huggingface.co)

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.