EVA: a new framework for evaluating voice AI agents

Voice conversational agents aren't just text turned into audio or speech recognition that returns a transcript. They're systems that need to complete tasks correctly and, at the same time, communicate like a good human on the phone. EVA was created to evaluate both aspects together, end-to-end.

Qué es EVA

EVA is an end-to-end evaluation framework for conversational voice agents that measures multi-turn spoken conversations using a realistic bot-to-bot architecture. It produces two high-level scores: EVA-A (Accuracy) and EVA-X (Experience), and adds diagnostic metrics to explain why an agent fails.

You might ask, why is this needed? Because many current benchmarks analyze only isolated components: STT, TTS, conversational dynamics, or task completion separately. That leaves the full interaction invisible, where a mis-transcribed character or an overly long reply can make everything fail in practice.

Qué es EVA

Arquitectura y componentes principales

Qué es EVA

Arquitectura y componentes principales

Métricas: EVA-A y EVA-X, y métricas diagnósticas

Metodología de evaluación

Resultados clave y hallazgos técnicos

Limitaciones y roadmap

Cómo usar EVA y dónde está el código

Equipo y agradecimientos

Fuente original

Stay up to date!

EVA: a new framework for evaluating voice AI agents