Why aren't cost-saving techniques that worked for static benchmarks like MMLU effective on newer agent benchmarks?

Static benchmarks could be compressed 100-200x because model performance often concentrated in a small subset of test items. Agent benchmarks are different because the expensive unit is a multi-step, dynamic 'rollout' for a single task. This entire trajectory is noisy and sensitive to the model and scaffolding used. As a result, simply subsampling tasks yields much smaller cost-saving gains—around 2x to 3.5x at best—because the irreducible cost of each complex, multi-turn interaction remains high.

AI evals are becoming the new compute bottleneck

The Price of Proof: AI Evaluation Emerges as a New Compute Bottleneck

The cost of evaluating advanced AI systems has crossed a critical threshold, shifting from a routine expense to a primary compute bottleneck that threatens to reshape the research landscape. While model training has long dominated budget conversations, the resources required to rigorously test complex AI agents and scientific models are now reaching tens of thousands of dollars for a single run. This financial barrier limits who can meaningfully participate in frontier AI development, as benchmarks like the Holistic Agent Leaderboard (HAL) and PaperBench show that verification can be as costly as innovation itself.

The New Economics of AI Evaluation

The financial and computational demands of modern benchmarks stand in stark contrast to earlier static evaluations. Where techniques once allowed for 100-200x cost reductions on benchmarks like MMLU by subsampling test items, today's agentic and training-in-the-loop evaluations resist such compression. The core unit of analysis is no longer a simple prediction but a complex, multi-step trajectory or a full model training process, making each data point intrinsically expensive. This has led to a new cost structure for state-of-the-art verification.

Holistic Agent Leaderboard (HAL): Spent approximately $40,000 to conduct 21,730 agent rollouts.
GAIA Benchmark: A single run with a frontier model can cost $2,829 before optimizations.
The Well (Scientific ML): A full four-baseline sweep requires 3,840 H100-hours, roughly $9,600 in compute.
MLE-Bench (OpenAI): A single-seed experiment costs around $5,500, combining 1,800 A10-hours with API usage.
PaperBench: Evaluating one agent costs about $8,000 in API fees, with a cheaper variant developed specifically because many groups cannot afford the full version.

From Technical Hurdle to Market Barrier

These escalating costs are creating a significant barrier to entry, potentially pricing out academic labs, startups, and independent researchers from validating their work against top models. The trend risks centralizing leadership within a few well-funded organizations like OpenAI and the UK-AISI that can absorb these expenses. Furthermore, these figures often represent just a single run. Achieving statistical reliability requires multiple runs, multiplying an already prohibitive cost and making rigorous, repeatable science an exclusive luxury. As benchmarks increasingly mirror real-world tasks that involve training or complex interaction, the industry faces a future where the cost to prove a model's worth could rival the cost to build it.

The industry must now treat evaluation not as a final validation step, but as a first-order resource allocation problem on par with pretraining compute. Failure to develop more efficient and accessible evaluation protocols will stifle innovation and consolidate leadership around the few organizations that can afford the growing cost of verification.

>> Verify Original Transmission at Hugging Face