Today IBM Research and the community are launching the Open Agent Leaderboard, an open evaluation that measures AI agents as complete systems, not just isolated models.
Why does that change things? Because when you deploy an agent you don't just pick a model: you pick planning, memory, error handling, and which tools it can use. Change any of those elements and the same model can behave very differently—and at very different costs.
What it is and why it matters
The Open Agent Leaderboard compares complete agent systems across six different benchmarks and reports both quality and cost. That lets you see not only what works, but what is worth deploying in production.
The core idea is to measure generality: how well an agent holds up when you let it face new tasks and rules without task-specific tuning. Generality is treated as a spectrum, not a binary label. What matters is that an agent stays capable as the variety of tasks grows—and that it does so at a reasonable cost.
The evaluation architecture: Exgentic and the unified protocol
The evaluation uses Exgentic, an open framework to run and reproduce tests. To make very different benchmarks coexist, they introduced a protocol that gives each task the same shape: a task (what to do), a context (what is known), and a set of actions (what's allowed).
That standardization means agents don't have to speak every benchmark's language: they all speak one. Adapting assumptions and interfaces was necessary, so results can differ from each benchmark's individual leaderboard.
The six benchmarks used
SWE-Bench Verified: fix real bugs in real repositories.
BrowseComp+: research complex questions on the web.
AppWorld: complete personal tasks across hundreds of apps and actions.
tau2-Bench Airline & Retail: customer service following company policies.
tau2-Bench Telecom: technical support following company policies.
Each benchmark was chosen because it brings a different dimension: real changes to code, open research, broad action spaces, conversations with rules. That mix is what gives the generality evaluation meaning.
What the leaderboard measures and how to read it
Each row of the leaderboard is a complete system: a concrete agent wrapped around a concrete model, evaluated across the six benchmarks. For each configuration it shows:
Average success rate.
Average cost per task.
Breakdown by benchmark.
So you can plot quality versus cost and see trade-offs: configurations with the same quality can differ by orders of magnitude in price.
Main findings (technical and practical)
The model still explains most of the variance in performance, but the agent architecture already has a visible impact. In other words: the agent matters.
Same model, different agents: different results and different costs. Example: among the top five, the first three use the same model but differ both in score and in cost because of their agent implementation.
How an agent handles failures matters as much as its success rate. In the experiments, failed runs cost between 20% and 54% more than successful ones. What does that mean for production? That optimizing for cheap failures can lower your bills.
The technique of tool shortlisting (narrowing which tools to consider before searching) improved performance across all models and turned failing configurations into viable ones. It's a practical, replicable lever.
General agents without benchmark-specific tuning competed with specialized systems in several cases. In other words: generality is already paying off.
About open-weight models: they added DeepSeek V3.2 and Kimi K2.5. Open results show these models are competitive in specific combinations, but on average they trail frontier closed models by 18 to 29 percentage points.
Implications for teams developing or deploying agents
If you're putting an agent into production, don't look only at success rate: look at cost per task and the failure pattern. Sometimes a cheaper, less brilliant agent is the better choice.
Document and version agent components (planning, memory, tool usage, context management, error recovery). That makes reproducibility easier and helps you diagnose which part drives improvements.
Integrate tool shortlisting and strategies to contain long, costly execution paths. Small agent-architecture changes can unlock big efficiency gains.
Use Exgentic to reproduce tests and expose your agent to untuned settings. If your agent survives without tuning, you have real evidence of generality.
Limitations and next steps for research
Not everything is covered: the benchmarks don't capture every capability a general agent will need, and some suites required adaptation because they weren't designed for general agents. This project is evolutionary and depends on the community to expand agents, benchmarks, and models.
The leaderboard is an open platform: you can submit results (PR to the dataset), integrate new benchmarks with a programmatic evaluator, or add open-weight models.
Technical reading and reproducibility
The full methodology and empirical analysis are in the associated paper. Exgentic provides standardized sessions, traces, and cost reports that let you reproduce experiments and break down what drives each result: model, agent design, or specific components.
If you want to experiment: try the published configurations, reproduce the traces, and analyze which components change how an agent fails or succeeds.
General agents aren't a distant promise. They're here, and measuring them correctly requires looking at the whole system, the cost, and how they fail. The Open Agent Leaderboard is a step toward more useful, open, and comparable evaluations. If you work with agents, this leaderboard gives you tools to evaluate and improve not just the model but everything around it.