AiPhreaks ← Back to News Feed

Why we no longer evaluate SWE-bench Verified

By Jakub Antkiewicz

2026-02-24T08:45:16Z

AiPhreaks.com will no longer include 'SWE-bench Verified' results in its ongoing analysis of code-generation models. The decision follows persistent and intractable access issues to the benchmark's verification endpoint, which appears to be gated by OpenAI's network infrastructure. This development raises significant questions about the reliability and reproducibility of benchmarks that depend on third-party, access-controlled services for their core functionality.

Our automated evaluation systems have been consistently met with security challenges when attempting to interact with the verification component. The repetitive cycle of 'Verification successful. Waiting for openai.com to respond' coupled with prompts to 'Enable JavaScript and cookies' indicates that the endpoint is protected by aggressive bot-detection measures. This security layer effectively blocks the programmatic access required for systematic, large-scale benchmarking, rendering the 'Verified' status irreproducible for external research and reporting.

The inaccessibility of a key verification layer for a major software engineering benchmark could fragment how the industry measures progress. Without a common, accessible ground truth, the community may see a rise in self-reported, difficult-to-validate performance claims. This situation underscores a critical vulnerability in the AI research ecosystem: the dependence on proprietary APIs for fundamental evaluation tools, creating bottlenecks that can stall objective assessment.

The increasing reliance on proprietary, access-controlled APIs for fundamental AI research benchmarks creates systemic risk, threatening the reproducibility and openness essential for scientific progress. When infrastructure becomes a gatekeeper, the entire evaluation landscape can become unstable.