EVA: a new framework for evaluating voice AI agents | Keryc
Voice conversational agents aren't just text turned into audio or speech recognition that returns a transcript. They're systems that need to complete tasks correctly and, at the same time, communicate like a good human on the phone. EVA was created to evaluate both aspects together, end-to-end.
Qué es EVA
EVA is an end-to-end evaluation framework for conversational voice agents that measures multi-turn spoken conversations using a realistic bot-to-bot architecture. It produces two high-level scores: EVA-A (Accuracy) and EVA-X (Experience), and adds diagnostic metrics to explain why an agent fails.
You might ask, why is this needed? Because many current benchmarks analyze only isolated components: STT, TTS, conversational dynamics, or task completion separately. That leaves the full interaction invisible, where a mis-transcribed character or an overly long reply can make everything fail in practice.
Arquitectura y componentes principales
EVA runs full spoken conversations between a user simulator and the evaluated agent, with a deterministic tool executor and automatic validators. The five components are:
User Simulator: a conversational AI with a defined goal and persona that speaks using high-quality TTS to recreate natural turn-taking and speech variation.
Voice Agent: the system under test. EVA supports cascade architectures (STT -> LLM -> TTS) and audio-native (S2S or S2T -> TTS) using Pipecat for real-time voice applications.
Tool Executor: deterministic Python functions that answer queries and modify the scenario database.
Validators: automatic metrics that verify the conversation reached the expected state; if validation fails, the conversation is regenerated.
Metrics Suite: uses recordings, transcripts, and tool-call logs to compute scores.
Each test is a reproducible record with: the user's goal, persona, scenario database, and ground truth of the expected final state.
Métricas: EVA-A y EVA-X, y métricas diagnósticas
EVA measures accuracy and experience across three subdimensions each, and also reports diagnostic metrics:
EVA-A (Accuracy)
Task Completion [deterministic]: compares the final state of the database against the ground truth.
Faithfulness [LLM-as-Judge]: detects fabrications, policy violations, and unsupported answers.
Agent Speech Fidelity [LALM-as-Judge]: evaluates at the audio level whether the agent pronounced critical entities correctly (codes, flight numbers, amounts).
EVA-X (Experience)
Conciseness [LLM-as-Judge]: whether responses are appropriately brief for spoken delivery.
Conversation Progression [LLM-as-Judge]: whether the conversation advances, retains context, and avoids getting stuck.
Turn-Taking [LLM-as-Judge]: whether the agent interrupts or leaves excessive silence.
Additional diagnostics isolate failure modes (ASR, synthesis, entity management, latency). EVA reports pass@k (probability that at least 1 of k runs succeeds) and pass^k (probability that all k runs succeed), using k = 3 by default to capture peak and consistency.
Metodología de evaluación
EVA combines deterministic metrics (fast and reproducible) with LLM or LALM-based judges for qualitative aspects. Each judge chosen is the one that performs best on a curated dataset for that metric. Conversations run in real audio to surface latency issues, turn-taking errors, and mistakes reproducing entities.
One key point: conversations that fail automatic validation are regenerated before analysis, avoiding costly human labeling later to filter out corrupted simulator runs.
Resultados clave y hallazgos técnicos
Twenty systems were evaluated (proprietary and open-source; cascade and audio-native) using an initial dataset of 50 scenarios in the aviation domain: rebooking for IRROPS, cancellations, vouchers, and standby.
Main findings:
Tradeoff Accuracy-Experience: there's a consistent tension. Systems that achieve a high task completion rate tend to have worse conversational experience, and vice versa. No configuration dominated both axes.
Named entities: transcribing names and codes is a dominant failure mode. A single wrong character can cancel authentications and break the conversation.
Multi-step flows: rebookings that must preserve ancillaries (seats, baggage) are the most frequent breakers.
Consistency: the gap between pass@3 and pass^3 is large. Many agents complete the task occasionally but not consistently, which is critical in production.
These results show why evaluating only task completion is insufficient for real deployments.
Limitaciones y roadmap
EVA is a step forward, but it has limits the authors acknowledge:
Framework: the user simulator uses a commercial TTS provider that can bias results toward certain ASR systems. Full replay requires access to commercial APIs and measurable latency varies with infrastructure.
Data: the initial release covers 50 scenarios in English and a single domain. There's not yet broad coverage of accents, languages, or extreme behaviors.
Metrics: LLM judges can introduce biases and stylistic affinities. Also, measuring task completion as binary doesn't capture partial credit.
Next steps announced:
Add prosodic evaluation (pronunciation, rhythm, expressiveness) and improve LALM-human alignment.
Robustness under noise, accent diversity, and multilingual users.
New domains and longer scenarios with extended conversational memory.
Error analysis tools and a continuously updated leaderboard.
Cómo usar EVA y dónde está el código
EVA is released with the initial dataset and judge prompts. The code and data are publicly available on GitHub, ready for researchers and product teams to replicate tests, extend scenarios, and compare cascade vs audio-native setups.
If you work with voice agents, try measuring EVA-A and EVA-X in parallel. Does your agent complete tasks but frustrate users on the line? Then you have a classic tradeoff that requires adjustments in conversational design, confidence calibration, and improved ASR robustness for entities.
Equipo y agradecimientos
Contributors include Tara Bogavelli, Gabrielle Gauthier Melancon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Hoang Nguyen, Raghav Mehndiratta, and Hari Subramani, among others. The project builds on prior work from ServiceNow's PAVA and CLAE teams.