Open Agent Leaderboard: evaluate AI agents by quality and cost

Today IBM Research and the community are launching the Open Agent Leaderboard, an open evaluation that measures AI agents as complete systems, not just isolated models.

Why does that change things? Because when you deploy an agent you don't just pick a model: you pick planning, memory, error handling, and which tools it can use. Change any of those elements and the same model can behave very differently—and at very different costs.

What it is and why it matters

The Open Agent Leaderboard compares complete agent systems across six different benchmarks and reports both quality and cost. That lets you see not only what works, but what is worth deploying in production.

The core idea is to measure generality: how well an agent holds up when you let it face new tasks and rules without task-specific tuning. Generality is treated as a spectrum, not a binary label. What matters is that an agent stays capable as the variety of tasks grows—and that it does so at a reasonable cost.

What it is and why it matters

The evaluation architecture: Exgentic and the unified protocol

The six benchmarks used

What the leaderboard measures and how to read it

Main findings (technical and practical)

Implications for teams developing or deploying agents

Limitations and next steps for research

Technical reading and reproducibility

Original source

Stay up to date!

Open Agent Leaderboard: evaluate AI agents by quality and cost