ScarfBench: benchmark for enterprise Java migrations | Keryc
AI-assisted modernization sounds like a magic trick: an agent scans your repo and leaves it production-ready. But can it really migrate complex enterprise applications without breaking anything? ScarfBench shows up to answer that with data, not promises.
What is ScarfBench
ScarfBench (Self-Contained Application Refactoring Benchmark) is an open benchmark designed to evaluate code agents on real migration tasks between enterprise Java ecosystems: Spring, Jakarta EE and Quarkus.
It doesn’t stop at comparing source files against a reference. Instead, it requires migrated applications to: compile, deploy, and preserve functional behavior. Why does that matter? Because a useful migration isn’t just pretty code: it’s code that runs in a real environment and does what it’s supposed to.
How ScarfBench evaluates
ScarfBench includes two types of tasks: focused migrations (components, layers) and full-application migrations. It starts from a taxonomy based on JSRs and uses expert-verified migrations to generate implementations for each target framework.
The evaluation follows three practical stages:
Build: the application must compile successfully.
Deploy: it must be deployable in the expected environment (container, server, etc.).
Behavioral validation: functional tests that confirm behavior was preserved.
This Compile -> Deploy -> Test flow exposes that compiling is not enough to ensure a correct migration.
Key results (what they found when putting agents to the test)
Agents that shine on traditional benchmarks show progress, but in ScarfBench the metrics are much humbler. Highlights:
Behavioral success rates are low: even the most advanced agents reach under 10% behavioral success for full applications.
Build successes exceed deploy successes, which in turn exceed behavioral validation. Conclusion: build alone overestimates quality.
Difficulty varies by framework pair; Jakarta EE proved particularly challenging.
A practical datapoint: one agent (Claude Code) reported successful builds for 29 of 30 full applications, but only 22 of those actually compiled when independently verified. And the only one the agent marked as failed did compile correctly. What does that tell you? Agents’ self-checks aren’t reliable; independent validation remains crucial.
Why migrating frameworks is much more than changing annotations
If you imagine a migration as a find-and-replace of annotations you’re underestimating the problem. A full migration usually requires changing:
Dependency injection and scopes.
Persistence configuration and queries.
Descriptors and config files (XML files, application.properties, application.yaml and similar).
Adaptations in the build system and wrappers (Maven, Gradle).
Small mistakes in any of these pieces can block deployment. Also, changes ripple across layers: configuration, web, database and services are the layers agents visit most.
Technical and engineering observations
ScarfBench doesn’t just measure whether something fails, but how and why it fails. Some observed patterns:
Agents repeatedly revisit configuration layers: that signals an iterative dependency-resolution process rather than a simple code transformation.
Failures don’t always originate in source code. Environmental issues like Docker caches, occupied ports, or inconsistencies in the Maven wrapper often block validation.
Difficulties span a wide spectrum: build systems, deployment infrastructure, dependency injection, databases, endpoints and test assertions.
Technically, this suggests an ideal agent needs more than good code generation: it requires architectural reasoning, environment state management, and diagnostic and repair capabilities.
What ScarfBench brings to the technical community
ScarfBench is a measurement and diagnostic tool:
A dataset of migrations and source code across multiple frameworks.
A public leaderboard to compare agents and approaches.
Documentation and open source code so researchers and engineering teams can reproduce and extend the studies.
For researchers: a platform to compare architectures, prompting techniques, fine-tuning, and interaction strategies with environments. For production teams: a testbed to validate modernization solutions before risking real systems.
What this means for your modernization project
If you’re thinking of using AI agents to migrate an enterprise application, take note:
Don’t trust the agent’s output alone. Implement independent build verification and functional tests.
Prepare your infrastructure: clean containers, port control, and consistent build wrappers reduce operational noise.
Consider a hybrid strategy: automate repetitive tasks with agents and leave architectural reasoning and critical verification to your human team.
ScarfBench shows the big frontier isn’t translating lines, but managing the web of dependencies between code, configuration, and infrastructure.
If you’re curious for a technical deep dive: examining the most common failures and how often agents revisit each layer gives clues on where to invest to make automation pay off (improve tests, standardize config, harden CI/CD, etc.).