AI is no longer just quick answers. You're increasingly asking agents to work for hours or even days on complex projects, but limited contexts break continuity. How do you get an agent to make coherent progress when every session starts without memory?
The problem of long-running agents
Imagine a software project where every shift shows up with no memory of what came before. Sounds chaotic, right? That's exactly what happens when an agent has to work across multiple context windows: each session is discrete and the next one remembers nothing.
Advanced models (for example Opus 4.5 in the Claude Agent SDK) have tools like compaction to save tokens and keep relevant information. But compaction isn't enough: the agent might try to "do it all" in a single session and leave work half-done, or declare the project finished before it actually is. Result? Incomplete code, missing documentation, and sessions that waste time trying to understand the past.
Anthropic proposes a harness with two key pieces: an initializer agent that prepares the environment in the first session, and a coding agent that, in each subsequent session, makes incremental progress and leaves clear artifacts for the next session.
Initializer agent
Creates the initial repo with an init.sh to start the server.
Generates a claude-progress.txt file that records what each session did.
Produces a feature_list.json with an extensive list of end-to-end requirements (in their demo, more than 200 features for a claude.ai clone).
Makes the first git commit to leave an initial history.
The idea is to keep enough context outside the model (in files and commits) so any new session can understand what has been advanced.
Coding agent
At the start of each session it runs basic commands like pwd, reads claude-progress.txt, feature_list.json and the git log.
Picks a single high-priority feature that still fails.
Implements that feature, tests it end-to-end (ideally with browser automation), makes a commit with a descriptive message and writes an update in claude-progress.txt.
This approach avoids the "one-shot" attempt and forces small, verifiable increments that make recovery from errors much easier.
Environment management and key files
Anthropic found that certain files reduce ambiguity between sessions:
init.sh: script to start the server and run basic checks.
feature_list.json: list of features as JSON objects that include steps and a passes (boolean) field.
claude-progress.txt: human/machine log of changes and decisions.
Git history: lets you revert to stable states.
Example entry in feature_list.json (simplified):
{ "category": "functional", "description": "New chat button creates a fresh conversation", "steps": ["Navigate to main interface", "Click the 'New Chat' button", "Verify a new conversation is created"], "passes": false }
Anthropic recommends not allowing agents to delete or rewrite the tests; they should only change passes to true after rigorous verification.
Testing and verification
A common failure was marking features as complete without end-to-end verification. The fix was to give the agent testing tools (for example Puppeteer MCP) and explicitly ask it to test like a user: open the app, create a chat, send a message and verify the response.
Limitations: the model's vision and automation tools aren't perfect. For example, Puppeteer might not expose native browser modals, and those cases remain fragile.
Typical session flow
[Assistant] I'll catch up with the project state.
[Tool Use] bash - pwd
[Tool Use] read - claude-progress.txt
[Tool Use] read - feature_list.json
[Tool Use] bash - git log --oneline -20
[Tool Use] run init.sh and verify the server is running.
[Assistant] Run basic functionality tests.
[Tool Use] Start implementing the selected feature.
[Assistant] Commit and write the update in claude-progress.txt.
This pattern cuts wasted tokens and prioritizes restoring a stable state before adding new code.
Failure modes and countermeasures
Problem: the agent declares the project finished too soon.
Initializer: generates feature_list.json with end-to-end requirements.
Coding agent: reads the list and works on a single feature.
Problem: messy or buggy code at the end of a session.
Initializer: leaves git history and progress notes.
Coding agent: runs tests, makes descriptive commits and updates claude-progress.txt.
Problem: marking features as done without tests.
Initializer: structures the feature list.
Coding agent: self-verifies and only sets passes=true after passing tests.
Problem: the agent doesn't know how to run the app.
Initializer: writes init.sh.
Coding agent: runs it at the start.
Future and open questions
It's still to be tested whether a generalist agent is better than a specialized multi-agent architecture (testing agent, QA agent, cleanup agent). You also need to validate if these practices generalize beyond full-stack web apps — for example to scientific research or financial modeling.
From practice, this is a clear lesson: to make AI autonomy work over long horizons you need to move part of the memory to disk and version control. Sounds obvious? Maybe. Does it work? In Anthropic's experiments, yes: it improves coherence, reduces wasted time and makes recovery from errors easier.