An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

Claude Opus 4.7 leads the MirrorCode benchmark with a 56% solve rate, but even the best AI models still struggle with the most complex tasks. Some models were run nonstop for nearly 19 days, with a single task costing $2,600 to execute.

In a striking demonstration of AI capabilities and computational costs, Epoch AI has unveiled the MirrorCode benchmark, a rigorous test designed to evaluate whether large language models can reconstruct entire software programs from scratch, without access to the original source code. The benchmark, which includes a 16,000-line toolkit, has become a litmus test for the limits of current AI reasoning and code generation abilities.

Breakthrough Performance and High Costs

Among the models tested, Claude Opus 4.7 emerged as the leader, achieving a 56% solve rate and successfully rebuilding the toolkit in just 14 hours. However, even this impressive performance highlights a critical limitation: no model tested could fully tackle the most complex tasks. The effort required to achieve these results is substantial—some models were run nonstop for nearly 19 days, with a single task costing $2,600 to execute.

Implications for the Future of AI and Coding

The MirrorCode benchmark underscores the growing sophistication of AI in code generation, yet also reveals the persistent challenges in achieving true autonomy in software development. As AI systems become more capable, the cost of running such intensive experiments raises questions about scalability and efficiency. These benchmarks may serve as a crucial step toward more robust, self-sufficient AI systems, but they also emphasize that we are still far from machines that can reliably recreate complex software ecosystems on their own.

Conclusion

While Claude Opus 4.7’s performance is a notable milestone, the MirrorCode task’s high cost and time investment point to a critical bottleneck in AI development. The journey toward AI systems that can autonomously build and maintain large-scale software remains a work in progress, with significant implications for the future of both AI and software engineering.

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

Breakthrough Performance and High Costs

Implications for the Future of AI and Coding

Conclusion

Related Articles

AI startup Lindy ditched Claude entirely for Deepseek, saving millions as cost pressure mounts on Anthropic

OpenAI unveils GPT-5.6 amid US AI regulatory drama

OpenAI Has New AI Models. Here’s Why You Can’t Use Them