EVA-Bench Data 2.0 expands voice agents benchmark | Keryc
EVA-Bench Data 2.0 arrives to bring more rigor and realism to how we evaluate enterprise voice agents. Why does it matter? Because an agent that handles confirmation codes for an airline can fail spectacularly when asked to process HR policies in healthcare. This version moves from one domain to three, with 213 scenarios covering 121 tools: roughly a 4x jump in coverage compared to the original.
What the new release includes
The three domains are clear and complementary: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM) and Healthcare HR Service Delivery (HRSD). In total there are 213 validated scenarios run against frontier models: OpenAI GPT-5.4, Google Gemini 3.1 Pro and Anthropic Claude Opus 4.6.
Airline (CSM): 50 scenarios.
ITSM: 80 scenarios.
HRSD (Healthcare): 83 scenarios.
Each scenario comes with a structured user objective, an initial state database and the expected final state. The whole package is open source under MIT and ready to download from Hugging Face.
Example code to load the datasets
You can load the sets with the datasets library like this:
from datasets import load_dataset
# Airline Customer Service Management (CSM) — 50 scenarios
airline = load_dataset("ServiceNow-AI/eva-bench", "airline", split="test")
# Enterprise IT Service Management (ITSM) — 80 scenarios
itsm = load_dataset("ServiceNow-AI/eva-bench", "itsm", split="test")
# Healthcare HR Service Delivery (HRSD) — 83 scenarios
hrsd = load_dataset("ServiceNow-AI/eva-bench", "medical", split="test")
Each record contains everything needed for a reproducible bot-vs-bot evaluation: user_goal, initial_scenario_database and expected_final_state.
Design principles (technical and practical)
The authors followed five clear principles worth understanding if you work on voice agents:
Voice-first scope. Only flows that truly happen over the phone are included in the benchmark. That avoids noise and keeps the evaluation relevant.
Realism. Tool schemas and policies are modeled after real APIs and regulations. In HRSD there are references to NPI, FMLA and US healthcare administration inputs.
Variety. Single-intent scenarios, multi-intent (up to 4 intents) and adversarial cases where the user tries to bypass steps or access things without authorization.
Authentication. Authentication flows are included and calibrated by context; for example, elevation by OTP only where it makes sense.
Reproducibility. Each scenario has exactly one correct resolution path to avoid ambiguous evaluation signals.
Joint generation and validation: SyGra + LLMs
Here’s the most technical and novel part: EVA-Bench uses SyGra, a graph-based pipeline, with GPT-5.4 as the generation engine. Each scenario needs three components generated together to avoid inconsistencies:
User goal structured as a decision tree so the simulator is deterministic.
Initial scenario database with all entities the scenario will reference.
Expected final database state obtained by running the LLM over the agent instructions and the scenario, leaving a trace of actions and the terminal state.
Joint generation prevents silent errors like "referenced case ID that doesn't exist." After generation, they run a validation loop with three steps:
Structural check against a Pydantic schema to catch type errors and missing fields.
LLM-based validator that checks consistency between the goal and the database, cross-references and authentication configuration.
LLM-based trace verification that ensures policy compliance, correct ordering of actions and absence of alternative paths that would create nondeterminism.
After automatic generation, every scenario went through manual reviews that corrected or discarded ambiguous records.
Evaluation with frontier models and dataset cleanup
As a final quality control, samples were executed in text-only mode (no audio) with GPT-5.4, Gemini 3.1 Pro and Claude Opus 4.6. If any model scored 0 on task completion, the researchers investigated whether it was due to:
A genuine model error.
Ambiguity in the policy.
A poorly specified user objective.
A bug in the tool executor.
Inconsistency between the initial and expected state.
Cases with dataset problems were corrected or removed. Crucially: every selected sample is solvable by at least one of the frontier models, which ensures the benchmark is challenging but fair.
Multilingualism and cultural adaptation
EVA-Bench expands beyond English. It's not just translating sentences: the adaptation includes names, phone numbers, email formats and locales. The evaluation pipeline is also adapted to respect cultural and linguistic variations that affect ASR and conversational interpretation.
Practical example: the same intent in English and French uses different locations, names and numbers so the simulator sounds authentic in each language. This lets you measure language-specific degradations in speech recognition and conversational reasoning.
How this helps you today if you work with voice agents
If you evaluate a voice agent in production, run your system against these 213 scenarios to surface real failures in authentication, handling structured entities and multi-intent flows.
If you build your own dataset, the methodology section and the SyGra pipeline are a reference to generate reproducible, verifiable scenarios.
If you're interested in multilingual deployment, the proposed extension gives you a path to adapt both data and metrics.
Quick technical recommendations
Prioritize tests on adversarial scenarios and those with unsatisfiable goals; that's where models fail most.
Use joint generation (goal + initial DB + final state) to avoid nondeterminism in automated evaluations.
Incorporate LLM validation and structural checks before accepting a record into your benchmark.
This release is not just more data. It's a replicable recipe to create voice benchmarks that are realistic, reproducible and multilingual. If your team develops or evaluates conversational agents, EVA-Bench Data 2.0 is a direct, practical technical resource to find critical failures before deployment.