What is SWE-bench and why is the 'Verified' component important?

SWE-bench is a standard benchmark that evaluates an AI model's ability to solve real-world software engineering tasks from GitHub repositories. The 'Verified' designation is critical as it confirms a model's solution has passed all relevant tests and is considered a correct fix, offering a more rigorous measure of performance than simple automated test pass rates.

Why we no longer evaluate SWE-bench Verified

AiPhreaks.com will no longer include 'SWE-bench Verified' results in its ongoing analysis of code-generation models. The decision follows persistent and intractable access issues to the benchmark's verification endpoint, which appears to be gated by OpenAI's network infrastructure. This development raises significant questions about the reliability and reproducibility of benchmarks that depend on third-party, access-controlled services for their core functionality.

Our automated evaluation systems have been consistently met with security challenges when attempting to interact with the verification component. The repetitive cycle of 'Verification successful. Waiting for openai.com to respond' coupled with prompts to 'Enable JavaScript and cookies' indicates that the endpoint is protected by aggressive bot-detection measures. This security layer effectively blocks the programmatic access required for systematic, large-scale benchmarking, rendering the 'Verified' status irreproducible for external research and reporting.

The inaccessibility of a key verification layer for a major software engineering benchmark could fragment how the industry measures progress. Without a common, accessible ground truth, the community may see a rise in self-reported, difficult-to-validate performance claims. This situation underscores a critical vulnerability in the AI research ecosystem: the dependence on proprietary APIs for fundamental evaluation tools, creating bottlenecks that can stall objective assessment.

The increasing reliance on proprietary, access-controlled APIs for fundamental AI research benchmarks creates systemic risk, threatening the reproducibility and openness essential for scientific progress. When infrastructure becomes a gatekeeper, the entire evaluation landscape can become unstable.