What makes EVA-Bench 2.0 scenarios more realistic than other synthetic datasets?

The realism in EVA-Bench 2.0 stems from its design principles. Tool schemas are modeled after production APIs, and scenario policies are based on actual enterprise constraints, such as real US healthcare policies (NPI numbers, FMLA) in the HRSD domain. It also includes unsatisfiable and adversarial scenarios where users attempt to bypass rules, reflecting the complexity of real customer service calls beyond simple 'happy path' interactions.

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

ServiceNow-AI Releases Expanded EVA-Bench 2.0 Benchmark

ServiceNow-AI has released EVA-Bench Data 2.0, a significant expansion of its open-source benchmark for evaluating enterprise voice agents. The new version grows from one enterprise domain to three—Airline Customer Service (CSM), IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD)—to better address the domain-specific nature of agent failures. This update increases the benchmark's scope by approximately 4x, now encompassing 213 distinct scenarios across 121 tools, providing a more challenging and realistic testbed for developers and enterprises deploying AI-driven voice systems.

Technical Design and Validation

The benchmark's scenarios were synthetically generated using ServiceNow-AI's SyGra pipeline with GPT-5.4 as its backbone. Each scenario was designed with five core principles in mind: voice-first scope, realism based on production APIs and policies, variety including adversarial and unsatisfiable user goals, domain-specific authentication, and strict reproducibility. To ensure fairness and solvability, every scenario was validated against three frontier models: OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6.

Domains: Airline Customer Service (CSM), Enterprise IT Service Management (ITSM), Healthcare HR Service Delivery (HRSD)
Total Scenarios: 213
Total Tools: 121
Scenario Types: Single-intent, multi-intent, and adversarial calls
Generation Backend: SyGra pipeline with GPT-5.4

Impact on AI Evaluation and Multilingual Support

By open-sourcing the datasets and evaluation framework on HuggingFace, EVA-Bench 2.0 provides the industry with a standardized tool for assessing agent performance in complex, real-world enterprise environments. The focus on reproducibility, with each scenario admitting exactly one correct resolution path, allows for more reliable comparisons between models. Furthermore, ServiceNow-AI announced plans for a multilingual extension, which will adapt scenarios, user data, and evaluation metrics for non-English languages to provide critical insights for global AI deployments.

By mandating a single correct resolution path and generating user goals as deterministic decision trees, ServiceNow-AI's EVA-Bench 2.0 moves enterprise agent evaluation from ambiguous success rates to a more rigorous, engineering-grade benchmark focused on reproducibility and policy compliance.

>> Verify Original Transmission at Hugging Face