Harness for building long-running AI apps

I've been reading Anthropic's technical experiment on how they designed a harness so models like Claude can build complete applications across long sessions. What's the trick? A multi-agent architecture inspired by GANs and a lot of prompt and evaluation engineering to keep quality and coherence for hours.

What they did (technical summary)

Anthropic designed a three-agent system: planner, generator and evaluator. The central idea is to separate responsibilities: the planner turns a short prompt into an ambitious spec; the generator implements features (typical stack: React, Vite, FastAPI, SQLite/Postgres); and the evaluator tests the running app using Playwright MCP to give actionable feedback.

They started from two prior lessons: break work into manageable chunks and use structured artifacts to pass context between sessions. To push quality ceilings they also used a GAN-like strategy: a generator creates and an evaluator judges. The evaluator was calibrated with few-shot examples and concrete criteria to turn subjective judgments into gradable metrics.

What they did (technical summary)

What they did (technical summary)

Frontend experiment: generator-evaluator loop

Applying it to full-stack and the three-agent architecture

Evolution with models: resets, compaction and Opus 4.x

Practical results: a DAW and cost lessons

Practical recommendations for developers who want to replicate this

Final reflection

Original source

Stay up to date!

Harness for building long-running AI apps