Why we no longer evaluate SWE-bench Verified

OpenAI announces it will no longer evaluate SWE-bench Verified due to contamination and data leakage issues. The organization recommends SWE-bench Pro as a replacement.

OpenAI has announced it will no longer evaluate its SWE-bench Verified benchmark, citing significant concerns about the integrity and accuracy of the dataset. The AI research organization stated that the benchmark has become increasingly contaminated and no longer provides reliable measurements of coding progress at the frontier of artificial intelligence.

Contamination and Data Leakage Issues

The decision comes after OpenAI's analysis revealed several critical problems with SWE-bench Verified. The organization identified flawed test cases and evidence of training data leakage, which undermines the benchmark's ability to accurately assess AI coding capabilities. "Our analysis shows that the benchmark has become increasingly contaminated and mismeasures frontier coding progress," OpenAI explained in its blog post.

Recommendation for SWE-bench Pro

In place of SWE-bench Verified, OpenAI recommends using SWE-bench Pro, which the company claims addresses the issues found in its predecessor. The new benchmark aims to provide more reliable and accurate measurements for evaluating AI coding performance. This shift reflects the broader challenges in maintaining the integrity of AI evaluation datasets as models become increasingly sophisticated and capable.

The announcement underscores the ongoing difficulties in creating robust benchmarks for AI development, particularly in the rapidly advancing field of coding assistants and automated software development tools. As AI systems become more powerful, the need for accurate and uncontaminated evaluation metrics becomes increasingly critical for tracking genuine progress.

Why we no longer evaluate SWE-bench Verified

Contamination and Data Leakage Issues

Recommendation for SWE-bench Pro

Related Articles

Y Combinator founder Paul Graham says AI-written founder emails feel like being lied to

BNP Paribas works with Mistral on a European answer to Anthropic’s Mythos

Pope Leo’s first encyclical reads as tech regulation as much as theology