ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
By Jakub Antkiewicz
•2026-07-01T11:24:57Z
IBM Research Launches ScarfBench to Test AI on Real-World Code Migration
Researchers from IBM Research have released ScarfBench, an open benchmark designed to realistically evaluate AI agents on the complex task of migrating enterprise Java applications between different frameworks. The project addresses a critical gap in AI evaluation, moving beyond simple code generation to test whether agents can handle the intricate dependencies of real-world software modernization, where success requires preserving application behavior, not just translating syntax.
Beyond Code Translation: A More Realistic Measure
ScarfBench focuses on migrations across three major Java ecosystems: Spring, Jakarta EE, and Quarkus. Unlike typical benchmarks that check code against a reference solution, it validates success based on three critical, real-world stages: whether the migrated application can actually build, deploy correctly, and pass a suite of expert-written behavioral tests. Initial results from the benchmark are sobering, showing that even leading AI agents find this challenging, with the best performers achieving less than 10% end-to-end behavioral success.
- Applications: 34
- Framework Implementations: 102
- Migration Tasks: 204
- Lines of Code: ~151K
- Expert-Written Tests: 1,331
Key Findings and Industry Impact
The benchmark's findings indicate that the primary obstacles in AI-assisted modernization are not in Java code transformation but in managing the complex web of dependencies across configuration files, build tooling, and runtime environments. The research also revealed that agents are often overconfident, reporting successful builds for applications that fail to compile. This suggests that for the foreseeable future, reliable, independent validation and architectural reasoning will remain essential human-led tasks, and ScarfBench provides a standardized tool for the industry to measure progress toward more capable modernization agents.
The ScarfBench findings reveal that for enterprise modernization, AI agent success hinges less on code translation and more on navigating the complex, non-code dependencies of build systems, deployment environments, and runtime configurations—a reality check for claims of fully autonomous software engineering.