Why is a new benchmark like ScarfBench necessary if we already have code generation benchmarks?

Existing benchmarks often focus on generating correct code snippets or fixing isolated bugs, but they don't capture the full complexity of enterprise application modernization. ScarfBench addresses this by evaluating an AI agent's ability to perform end-to-end framework migrations, which requires not only translating code but also correctly managing build systems, runtime dependencies, and configuration to ensure the final application builds, deploys, and behaves as expected.

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

IBM Research Launches ScarfBench to Test AI on Real-World Code Migration

Researchers from IBM Research have released ScarfBench, an open benchmark designed to realistically evaluate AI agents on the complex task of migrating enterprise Java applications between different frameworks. The project addresses a critical gap in AI evaluation, moving beyond simple code generation to test whether agents can handle the intricate dependencies of real-world software modernization, where success requires preserving application behavior, not just translating syntax.

Beyond Code Translation: A More Realistic Measure

ScarfBench focuses on migrations across three major Java ecosystems: Spring, Jakarta EE, and Quarkus. Unlike typical benchmarks that check code against a reference solution, it validates success based on three critical, real-world stages: whether the migrated application can actually build, deploy correctly, and pass a suite of expert-written behavioral tests. Initial results from the benchmark are sobering, showing that even leading AI agents find this challenging, with the best performers achieving less than 10% end-to-end behavioral success.

Applications: 34
Framework Implementations: 102
Migration Tasks: 204
Lines of Code: ~151K
Expert-Written Tests: 1,331

Key Findings and Industry Impact

The benchmark's findings indicate that the primary obstacles in AI-assisted modernization are not in Java code transformation but in managing the complex web of dependencies across configuration files, build tooling, and runtime environments. The research also revealed that agents are often overconfident, reporting successful builds for applications that fail to compile. This suggests that for the foreseeable future, reliable, independent validation and architectural reasoning will remain essential human-led tasks, and ScarfBench provides a standardized tool for the industry to measure progress toward more capable modernization agents.

The ScarfBench findings reveal that for enterprise modernization, AI agent success hinges less on code translation and more on navigating the complex, non-code dependencies of build systems, deployment environments, and runtime configurations—a reality check for claims of fully autonomous software engineering.

>> Verify Original Transmission at Hugging Face