What makes the VAKRA benchmark's evaluation process unique?

Unlike benchmarks that only check the final answer, VAKRA uses an execution-centric 'waterfall' pipeline. It first executes an agent's proposed tool calls in a live environment to verify the entire reasoning trajectory. It programmatically checks if the agent retrieved all necessary information—even allowing for different but valid tool sequences—before evaluating if the final response is correctly grounded in those tool outputs. This focuses on the validity of the process, not just the result.

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM Research Details VAKRA Benchmark for Enterprise Agent Reasoning

IBM Research has released details on VAKRA, a new executable benchmark designed to evaluate how well AI agents can reason and use tools in enterprise-like scenarios. The benchmark moves beyond testing isolated skills by measuring an agent's ability to handle compositional reasoning across thousands of APIs and documents. By assessing the full execution trace of multi-step workflows, VAKRA provides a more realistic measure of an agent's reliability for complex business tasks, with initial results showing that current models perform poorly.

A Closer Look at the VAKRA Environment

The VAKRA benchmark provides a sandboxed environment where agents must interact with over 8,000 locally hosted APIs and domain-specific documents. The tasks are structured to test a progressive set of agent capabilities, requiring reasoning chains of three to seven steps that combine structured API calls with unstructured information retrieval. The benchmark is divided into four distinct task categories, each with increasing complexity.

Capability 1 (API Chaining): Involves 2,077 instances where agents must chain between 1-12 tool calls from the SLOT-BIRD and SEL-BIRD collections to manipulate data and find an answer.
Capability 2 (Tool Selection): Features 1,597 instances that challenge agents to select the correct tool from large, domain-specific sets of up to 328 APIs, a task complicated by API limitations like OpenAI's 128-tool context limit.
Capability 3 (Multi-Hop Reasoning): Contains 869 instances requiring agents to synthesize information from multiple API calls to answer a single query.
Capability 4 (Multi-Hop, Multi-Source Reasoning): The most complex tier with 644 instances, this combines multi-hop API reasoning with document retrieval (RAG), multi-turn dialog, and adherence to natural-language tool-use policies.

Execution-Centric Evaluation Sets a New Standard

A key component of VAKRA is its rigorous, execution-centric evaluation framework. Instead of merely checking the final answer, the evaluator follows a waterfall pipeline that first verifies the agent's tool-call trajectory. It executes the predicted tool sequence in the benchmark's environment and programmatically checks if the agent recovered all necessary information, even allowing for alternative but valid reasoning paths via a secondary LLM-based check. Only if the tool trajectory is deemed correct does the evaluation proceed to judge the final answer's grounding and factual consistency. This process ensures agents are scored on the validity of their reasoning, a critical factor for building trust in automated enterprise systems.

VAKRA's emphasis on executable, multi-step workflows and policy adherence signals a necessary evolution in AI agent evaluation. The benchmark moves the industry's focus from simple answer accuracy to the reliability and verifiability of the underlying reasoning process, a critical step for enterprise adoption.

>> Verify Original Transmission at Hugging Face