A few years ago, the idea of letting an AI agent work on its own for days sounded like science fiction. Today you can specify a clear goal, give context, a set of rules, and watch a team of agents perform complex numerical work while you only supervise occasionally. Sounds like magic? It's project management supercharged by models capable of long-range tasks.
Qué es un flujo de trabajo "long-running" para ciencia
Anthropic describes how to move from a short conversational loop to a workflow where an agent operates autonomously for days: initial planning, persistent memory, test oracles, and orchestration patterns. This lets you compress months of human work into days for well-bounded tasks: rewriting legacy code, reimplementing a numerical solver, or debugging a large codebase against a reference.
In the technical example, they use Claude Opus 4.6 with Claude Code to implement a differentiable version of a cosmological Boltzmann solver. That solver evolves coupled equations for photons, baryons, neutrinos, and dark matter, and its output is compared to data like Planck's. Making it differentiable in JAX opens the door to gradient-based inference, speeding parameter estimation dramatically.
Por qué tiene sentido este enfoque
- The tasks that fit have quantifiable success criteria and are well delimited.
- Numerical errors tend to propagate downstream, so the agent must trace the causality of a bug, not just execute isolated steps.
- JAX is a natural choice: automatic differentiation and compatibility with accelerators like GPUs give you an edge without extra plumbing.
Componentes prácticos del flujo de trabajo
Below I explain the patterns that truly matter if you want to try this in an academic lab.
1) CLAUDE.md: la especificación viva
Before you unleash agents, design a CLAUDE.md at the repo root. There you encode objectives, success criteria (for example 0.1% agreement with the CLASS reference implementation), constraints, and commit rules. Claude keeps that file in context and can update it as it works.
Important: leave explicit rules about when to commit, what to test before push, and what not to touch.
2) CHANGELOG.md: memory for the long run
Use a progress file (for example CHANGELOG.md) as the agent's portable memory. Record current state, completed tasks, failed approaches and why they failed. Without this, repeated sessions will stumble over the same dead ends.
Example changelog entry:
- I tried Tsit5 for the perturbation ODEs, the system is too stiff. Switched to Kvaerno5.
3) Test oracle: the project's compass
The agent needs a quantifiable way to know if it's making progress. For scientific code that usually means:
- A reference implementation (for example CLASS in C)
- An explicit numerical target (0.1% at critical points)
- A test suite that grows with the project
Instruct the agent to expand and run the suite constantly to avoid regressions. In this project, Claude built and ran unit tests using CLASS as the reference.
4) Git as light coordination
Have the agent commit and push after each meaningful unit of work. This gives you a recoverable history and visibility. A practical rule: run pytest tests/ -x -q before every commit and never push code that breaks existing tests.
5) Orchestration on HPC with SLURM and tmux
On an HPC cluster you can run Claude Code inside a tmux session on a node reserved by SLURM. An example job script:
#!/bin/bash
#SBATCH --job-name=claude-agent
#SBATCH --partition=GPU-shared
#SBATCH --gres=gpu:h100-32:1
#SBATCH --time=48:00:00
#SBATCH --output=agent_%j.log
cd $PROJECT/my-solver
source .venv/bin/activate
export TERM=xterm-256color
tmux new-session -d -s claude "claude; exec bash"
tmux wait-for claude
Once started, connect with:
srun --jobid=JOBID --overlap --pty tmux attach -t claude
You can attach to give targeted direction like "Read CHANGELOG.md and take the next task" and then detach.
Patrones de orquestación útiles: el Ralph loop y variantes
Current models can show "agentic laziness": they finish a subtask before truly completing the overall goal. That's where the Ralph loop helps: it's basically a loop that asks the agent if it really finished, and re-invokes it if the answer doesn't meet the criterion.
Installation and example usage inside Claude Code (plugin notation):
/ralph-loop:ralph-loop 'Please keep working on the task until the success criterion of 0.1% accuracy across the entire parameter range is achieved.' --max-iterations 20 --completion-promise 'DONE'
Claude will repeat up to 20 iterations and only close with the promise 'DONE' when internal checks confirm the objective.
Other similar patterns include GSD and domain-adapted variants, and the native /loop command in Claude Code.
Resultados y limitaciones observadas
In the described experiment, Claude worked for several days and reached sub-percent agreement with the CLASS reference on main outputs like the CMB angular spectra. Progress was traced with git milestones and accuracy metrics.
But not everything was perfect. The agent showed common shortcomings:
- Incomplete test coverage: for a while it only tested a fiducial point, reducing its ability to catch errors.
- Elementary mistakes: confusion in gauge conventions, or hours lost on bugs a cosmologist would spot immediately.
- Not yet production-grade across all numerical regimes.
Still, the experience showed that an agent with a good test oracle, memory, and clear rules can accelerate research work from months to days.
Implicaciones prácticas y éticas
Having agents work nights and weekends changes the lab's economics: every night you don't run agents could be lost progress. That forces new priorities: environment security, compute costs, and human validation at critical points.
You also need transparency in the change log and manual review of key scientific steps. An agent can speed engineering, but human verification and physical interpretation remain essential.
In the end, the experiment works like a recipe: define clear objectives, provide a reference oracle, discipline the agent with memory and commits, and use re-check loops like Ralph loop. If you have compute and projects with quantifiable criteria, you can start delegating nights of work. You'll learn not only from the code the agent produces, but from the record of decisions it leaves behind: an AI-generated lab notebook that helps you absorb the knowledge.
