AiPhreaks ← Back to News Feed

A New Framework for Evaluating Voice Agents (EVA)

By Jakub Antkiewicz

2026-03-24T08:53:56Z

Researchers at ServiceNow-AI have released EVA, a new open-source framework for evaluating conversational voice agents. The framework introduces a method for jointly measuring both an agent's task accuracy and the quality of its conversational experience, two objectives that are often at odds. This approach addresses a significant gap in testing, as existing benchmarks typically assess these components in isolation, failing to capture how they interact and trade off against each other in realistic, multi-turn spoken dialogues.

EVA operates using a bot-to-bot architecture that simulates live audio conversations, where a user simulator with a specific goal interacts with the voice agent being tested. The framework produces two primary scores: EVA-A for Accuracy, which measures task completion, faithfulness to provided information, and the fidelity of the agent's spoken audio; and EVA-X for Experience, which assesses the conciseness, progression, and turn-taking dynamics of the conversation. The initial release includes a dataset of 50 airline industry scenarios, such as flight rebooking and cancellations, and provides benchmark results for 20 different cascade and audio-native systems.

The most prominent finding from the initial study is a consistent tradeoff between accuracy and experience: systems that excelled at completing tasks often provided a poor conversational experience, and vice versa. This suggests that a narrow focus on task success can lead to agents that are functionally correct but difficult for users to interact with. The results also highlight that transcribing named entities correctly remains a primary point of failure and that agents frequently struggle with multi-step workflows and consistent performance, indicating significant calibration is needed before they are ready for production use.

The documented tradeoff between accuracy and experience in voice agents means success is not just about completing a task, but about the quality of the entire interaction. Enterprises must now treat conversational design as a first-order metric, on par with functional correctness, to avoid deploying agents that are effective but unusable.