AiPhreaks ← Back to News Feed

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

By Jakub Antkiewicz

2026-06-05T11:28:46Z

ServiceNow-AI Releases Expanded EVA-Bench 2.0 Benchmark

ServiceNow-AI has released EVA-Bench Data 2.0, a significant expansion of its open-source benchmark for evaluating enterprise voice agents. The new version grows from one enterprise domain to three—Airline Customer Service (CSM), IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD)—to better address the domain-specific nature of agent failures. This update increases the benchmark's scope by approximately 4x, now encompassing 213 distinct scenarios across 121 tools, providing a more challenging and realistic testbed for developers and enterprises deploying AI-driven voice systems.

Technical Design and Validation

The benchmark's scenarios were synthetically generated using ServiceNow-AI's SyGra pipeline with GPT-5.4 as its backbone. Each scenario was designed with five core principles in mind: voice-first scope, realism based on production APIs and policies, variety including adversarial and unsatisfiable user goals, domain-specific authentication, and strict reproducibility. To ensure fairness and solvability, every scenario was validated against three frontier models: OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6.

  • Domains: Airline Customer Service (CSM), Enterprise IT Service Management (ITSM), Healthcare HR Service Delivery (HRSD)
  • Total Scenarios: 213
  • Total Tools: 121
  • Scenario Types: Single-intent, multi-intent, and adversarial calls
  • Generation Backend: SyGra pipeline with GPT-5.4

Impact on AI Evaluation and Multilingual Support

By open-sourcing the datasets and evaluation framework on HuggingFace, EVA-Bench 2.0 provides the industry with a standardized tool for assessing agent performance in complex, real-world enterprise environments. The focus on reproducibility, with each scenario admitting exactly one correct resolution path, allows for more reliable comparisons between models. Furthermore, ServiceNow-AI announced plans for a multilingual extension, which will adapt scenarios, user data, and evaluation metrics for non-English languages to provide critical insights for global AI deployments.

By mandating a single correct resolution path and generating user goals as deterministic decision trees, ServiceNow-AI's EVA-Bench 2.0 moves enterprise agent evaluation from ambiguous success rates to a more rigorous, engineering-grade benchmark focused on reproducibility and policy compliance.
End of Transmission
Scan All Nodes Access Archive