VAKRA shows limits of agents in reasoning and APIs

Apr 15, 2026Keryc Díaz5 minutes

VAKRA is an executable benchmark that tests whether AI agents not only know how to call tools, but can compose steps, handle data, and respect policies in environments similar to a company. Do you really have an agent that reasons, or does it just guess the next API to use? Here I explain how VAKRA is built, what it measures, and why its results matter when deploying agents to production.

Qué es VAKRA y por qué importa

VAKRA (tool-grounded, executable benchmark) evaluates agents’ ability to complete multi-step workflows in realistic settings: more than 8,000 local APIs, 62 domains, and document collections aligned by domain. Unlike isolated benchmarks, VAKRA demands composition: combining calls to structured APIs with retrieval of unstructured documents, while respecting usage constraints when they apply.

What's the goal? To measure not just the final answer, but the full execution trajectory: tool calls, arguments, intermediate outputs, and the final response. That executable evaluation is what uncovers subtle failures that classic tests miss.

Arquitectura y colecciones de herramientas

Entorno ejecutable: an MCP server exposes local APIs and avoids heavy data transfers by keeping datasets server-side.
Collections of tools:
- SLOT-BIRD: 7 generic data-manipulation tools (filter, sort, etc.).
- SEL-BIRD: larger collections with specific functions (e.g. sort_data_ascending and sort_data_descending), and key getters like get_team_name.
- REST-BIRD: endpoint-style APIs served by FastAPI; each domain has between 6 and 328 tools (avg ~116).
Key instrument: the function get_data(tool_universe_id=...) that you must call when starting an instance. It returns a lightweight preview and registers the dataset on the server.

Small example of a flow (shortened version):

{
  "query": "Which football team has a build-up play speed of 31...",
  "tool_calls": [
    {"name": "get_data", "arguments": {"tool_universe_id": "..."}},
    {"name": "select_data_equal_to", "arguments": {"key_name": "play_speed", "value": 31}},
    {"name": "get_team_name", "arguments": {"n": 1}}
  ],
  "answer": "FC Barcelona"
}

Las cuatro capacidades que mide VAKRA

BI API (SLOT-BIRD and SEL-BIRD) - 2,077 instances: sequencing 1–12 tool calls to derive answers from tabular datasets.
Tool Selection (REST-BIRD) - 1,597 instances: choosing the correct API from a large set of endpoints.
MultiHop API (REST-BIRD) - 869 instances: multi-hop reasoning over APIs (1–5 logical hops).
MultiHop MultiSource + Policies (REST-BIRD + documents) - 644 instances: mixing APIs and document retrieval, multi-turn dialogs and tool-use policies.

Capability 4 is the most complex: for each hop the required source is specified (API or document retrieval). To avoid leakage, each source is decontaminated: the information needed for a hop is available only in the designated source.

Evaluación ejecutable: cómo se juzga un agente

VAKRA doesn't stop at comparing final answers. Its evaluator uses two inputs per sample: the predicted final answer and the trajectory of tool calls. The pipeline is cascading:

For capability 4, policy compliance is checked first.
The predicted tool sequence is compared against the ground truth, but with flexibility: alternative sequences are allowed if the retrieved outputs cover the required information.
If the trajectory is valid, the final answer is judged by an LLM grader that verifies it is grounded in the tools’ outputs.

To handle structural equivalences (order, aggregation, format), a programmatic check runs first. If ambiguity remains, an LLM-based evaluation adapted from the CRAG framework decides whether the predicted trajectory retrieved all necessary information.

This means an agent can score points if it uses routes different from the dataset author’s, as long as it retrieves the same information.

Métricas y scoring

Each capability carries equal weight for the final score.
Within capabilities 1–3, each sample is worth the same.
In capability 4, heterogeneous queries are weighted higher because of their complexity.

The evaluation considers both final-answer accuracy and the completeness and validity of the tool trajectory.

Principales modos de falla (análisis técnico)

VAKRA classifies failures by the first breaking point in execution, following this order:

Incorrect tool selection.
Missing or hallucinated required arguments.
Incorrect values in arguments.
Final answer not precise or not grounded in tool outputs.

By assigning each sample to the first detected failure, double counting is avoided and you get a disjoint error distribution. That makes it easier to interpret where agents break in the reasoning chain.

Resultados y lecciones de los modelos evaluados

Overall performance: models still score low on VAKRA. Picking APIs in isolation isn't enough for a robust deployment.
BI API (SLOT vs SEL): GPT-OSS-120B leads here due to better handling of schemas and parameters. In SLOT-BIRD (few tools, many parameters) other models often fail on argument values. In SEL-BIRD (many tools, fewer parameters) tool-selection errors increase.
Tool Selection (REST-BIRD): Gemini-3-flash-preview performs better at selecting the right tools.
Multi-hop: accuracy drops with hop depth; 1-hop is much easier than 2-hop and 3+ hops.
Multi-source and RAG: adding documents complicates everything. Some models, like GPT-OSS-120B, sometimes skip the retriever call in 1-hop RAG and answer from parametric memory—especially on Wikipedia-entity-style questions.
Policies: imposing tool-use restrictions reduces performance. Many models violate policies or fail to retrieve enough information while complying. Only Granite-4.0-h-Small-32B shows less degradation on certain policy types.

Figures and charts in the original blog detail error distributions by collection and interaction type (API, RAG, hybrid), and how hop depth and policies affect accuracy.

¿Qué significa esto para quienes construyen agentes?

Domain and schema matter as much as LLM capability. A good understanding of tool schemas reduces argument errors.
You should design tool shortlisting when API limits (e.g., 128 tools in OpenAI specs) force it.
Usage policies are strict in production. VAKRA-style tests help you discover whether your agent can follow operational constraints without losing accuracy.
VAKRA’s executable evaluator is a practical recipe to validate end-to-end agents: it lets you verify that alternative routes are valid and that answers are grounded.

Cómo probar tu agente en VAKRA

Explore the dataset: https://huggingface.co/datasets/ibm-research/VAKRA
Check the code and baselines on GitHub: https://github.com/IBM/vakra
Submit your solution to the leaderboard following the instructions: https://github.com/IBM/vakra?tab=readme-ov-file#submitting-to-the-live-leaderboard

If you think your agent is production-ready, this is the place to prove it with executable evidence. VAKRA doesn't forgive shortcuts: it shows whether you fail at tool selection, multi-hop reasoning, or policy compliance.

Original source

https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.