I've been reading Anthropic's technical experiment on how they designed a harness so models like Claude can build complete applications across long sessions. What's the trick? A multi-agent architecture inspired by GANs and a lot of prompt and evaluation engineering to keep quality and coherence for hours.
What they did (technical summary)
Anthropic designed a three-agent system: planner, generator and evaluator. The central idea is to separate responsibilities: the planner turns a short prompt into an ambitious spec; the generator implements features (typical stack: React, Vite, FastAPI, SQLite/Postgres); and the evaluator tests the running app using Playwright MCP to give actionable feedback.
They started from two prior lessons: break work into manageable chunks and use structured artifacts to pass context between sessions. To push quality ceilings they also used a GAN-like strategy: a generator creates and an evaluator judges. The evaluator was calibrated with few-shot examples and concrete criteria to turn subjective judgments into gradable metrics.
Key problems they tackled:
"Context anxiety": models that try to close out the task when they think they're near their context window limit. Solution: context resets or compaction depending on the model.
Self-evaluation too lenient: agents tend to approve their own work. Separating the evaluator and tuning its skepticism proved more effective.
Frontend experiment: generator-evaluator loop
For interface design they used an iterative loop: the generator produces HTML/CSS/JS, the evaluator navigates it with Playwright, takes screenshots, scores it by criteria (e.g. originality, fidelity to design principles) and returns detailed critique. They ran 5 to 15 iterations per generation; some runs lasted up to 4 hours.
Practical details:
They favored criteria that penalized "AI slop" patterns and encouraged aesthetic risk.
The evaluator was calibrated with examples and disaggregated scoring to reduce drift.
The wording of criteria directly influenced the character of outputs (phrases like "museum quality" nudged a certain style).
Notable result: the generator made creative jumps (for example, moving from a classic landing page to a 3D CSS experience through iteration), something hard to achieve in a single pass.
Applying it to full-stack and the three-agent architecture
When they scaled to full applications, they mapped the same pattern:
Planner: takes a 1–4 sentence prompt and generates an ambitious spec, prioritizing product scope and high-level design over implementation details.
Generator: implements in sprints (or rounds in the simplified version), performs self-evaluation and uses git for version control.
Evaluator: runs functional integration tests with Playwright, tests UI, endpoints and DB state, and applies criteria with hard thresholds; if one fails, the sprint doesn't pass.
Communication and handoff: agents exchange files (one writes, another reads), which keeps structured context across sessions without overloading the context window.
Cost vs quality comparison (RetroForge example):
Solo (single-agent): 20 min, $9, but significant functional failures.
Full harness: 6 hours, $200, much more polished and functional result.
The evaluator was key to catch concrete bugs (examples given: issues in reordering frames, delete handlers with wrong conditions, fillRectangle not triggered). But tuning the evaluator required iteration to counter its initial tendency to downplay problems.
Evolution with models: resets, compaction and Opus 4.x
They found two strategies to handle long context:
Compaction: summarize history to maintain continuity. Good if the agent doesn't suffer "context anxiety."
Reset + handoff: close the context and start a new agent, passing an artifact with enough state. Solves context anxiety but adds latency and complexity.
With Sonnet 4.5 they needed resets. With Opus 4.6, the model improved planning and context handling, allowing continuous sessions using automatic compaction and reducing the need for some scaffolds (e.g., forced sprints). Still, the evaluator remains valuable when the task is near the edge of what the model can do on its own.
Practical results: a DAW and cost lessons
Testing a prompt to build a DAW in the browser (Web Audio API), harness V2 ran ~3 h 50 min for about $125 in tokens. The generator built main views, transport, a mixer and an agent that generates song fragments, but QA still found interactive gaps (recording stub, clip drag, effect visualizers).
Cost-benefit lessons:
The harness delivers a clear jump in quality, but it's substantially more expensive and slower.
As models improve, reconsider which harness components remain necessary; some can be removed and others should adapt.
Practical recommendations for developers who want to replicate this
Start simple: try single-agent and add components until you see tangible improvement.
Separate agents by responsibility: planner for scope, generator for implementation, evaluator for verification.
Use structured artifacts (files, contracts) for handoffs between sessions or agents.
Invest in a real evaluator that interacts with the app (Playwright or another tool) and calibrate its skepticism with few-shot examples and hard thresholds.
Monitor traces and logs: read the evaluator's records to adjust prompts and tests.
Reevaluate the harness when you migrate to a new model: some pieces may no longer be needed and others will let you expand capabilities.
Final reflection
This work shows that harness engineering is both art and science. A powerful model alone isn't enough: the structure around the model, the type of evaluation and how context is exchanged determine whether a long session produces usable software. As models improve, the harness's role changes, not disappears. Want your agent to do complex work? Design the right feedback, and if you can, have another agent be the judge.