Why are the scores on ITBench-AA so low compared to other agent benchmarks?

The scores are low due to the benchmark's stringent 'average precision at full recall' scoring methodology. To receive any points for a task, an AI agent must correctly identify 100% of the ground-truth root-cause entities. If it misses even one, its score for that attempt is zero. Furthermore, if it correctly identifies all root causes but also includes incorrect entities (false positives), its precision score is penalized. This standard mirrors the high stakes of real-world SRE, where incomplete or inaccurate diagnoses are unacceptable.

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

New Benchmark Reveals Frontier AI Models Struggle with Enterprise IT Tasks

Artificial Analysis and IBM have launched ITBench-AA, the first benchmark designed to evaluate AI agents on complex enterprise IT tasks. The initial release focuses on Site Reliability Engineering (SRE), where even the most advanced frontier models are struggling, scoring below 50%. The report shows Claude Opus 4.7 leading with a 47% accuracy score, closely followed by GPT-5.5 at 46%. These results indicate that automating mission-critical functions like Kubernetes incident response remains a significant challenge, establishing ITBench-AA as a difficult new proving ground for agentic AI capabilities.

Methodology and Model Performance

The benchmark presents models with 59 unique SRE incident scenarios, where the agent must use shell commands within a sandboxed environment to diagnose the root cause of a system failure. The evaluation's scoring is particularly stringent, using average precision at full recall, meaning a model scores zero if it fails to identify all ground-truth root causes. This tough standard reveals interesting performance characteristics beyond simple accuracy.

Leaderboard: Claude Opus 4.7 (47%), GPT-5.5 (46%), and Qwen3.7 Max (42%) are the top performers.
Efficiency vs. Accuracy: More investigation does not lead to better results. GPT-5.5 averaged 31 turns to achieve its 46% score, while Gemini 3.1 Pro Preview took an average of 83 turns only to score 30%, often misidentifying symptoms as root causes.
Scoring: Models are penalized for submitting false positives, which discourages over-investigation and rewards precision.

The Emerging Cost-Performance Frontier

The benchmark also highlights a critical factor for enterprise adoption: cost-effectiveness. Open-weight models are demonstrating competitive performance at a fraction of the cost of their proprietary counterparts. For example, Gemma 4 31B achieves a 37% score at just $0.14 per task, significantly outperforming Gemini 3.1 Pro Preview (30% at $2.23 per task). Similarly, GLM-5.1 (40% at $1.23 per task) matches the score of Gemini 3.5 Flash at a lower cost. While Claude Opus 4.7 leads in performance, it is also the most expensive at $5.38 per task, underscoring the trade-offs between capability and operational cost that enterprises must now evaluate.

The low scores on ITBench-AA demonstrate that while agentic AI is promising, the reliable automation of complex, high-stakes enterprise IT tasks like SRE incident response remains an unsolved problem, shifting the focus from general capabilities to specialized, cost-effective performance.

>> Verify Original Transmission at Hugging Face