Nicholas Carlini and his team at Anthropic pushed Claude to do something that sounds like science fiction: coordinate many instances in parallel to write from scratch a compiler in Rust capable of compiling the Linux kernel. Sounds crazy? It is. Does it work? Largely yes, and what matters isn’t just the artifact but the lessons about how to design an environment for LLM agents to work autonomously for days.
Qué hicieron y por qué importa
The idea was simple to state and brutal to execute: run 16 Claude agents in parallel, each in its own container, collaborating on a shared Git repository to build a C compiler in Rust that compiles Linux 6.9 on x86, ARM and RISC-V. The project consumed almost 2,000 Claude Code sessions, 2,000,000,000 input tokens and generated 140,000,000 output tokens, with an approximate cost of $20,000 USD.
The result: a ~100,000-line compiler that compiles Linux and other big projects (QEMU, FFmpeg, SQLite, Redis, Postgres) and passes about 99% of several test suites, including GCC’s torture tests. It also compiles and runs Doom — which is always a nice emotional milestone for engineers.
But don’t imagine a perfect replacement: there are clear limitations (no real-mode 16‑bit support for x86, reliance on external assembler and linker for some phases, and generated code that’s less efficient than GCC). What’s interesting is what they learned about engineering the “harness” that lets LLM agents work autonomously and in parallel.
Cómo habilitaron Claudes de larga duración
The problem with traditional LLM agents is they stop: they wait for human input, questions, or confirmations. To remove that dependency they built an infinite loop that restarts Claude every time a task finishes, putting the instance into continuous work mode. Essentially, when one session ends, another boots and continues with the next step.
Don’t overcomplicate it: think of a tiny supervisord that restarts processes. In their case the loop looked something like while true; do claude -p "AGENT_PROMPT.md" ...; done. Also, each agent runs inside an isolated container to avoid environment contamination.
A funny and real detail: at one point Claude ran pkill -9 bash by accident and killed itself. Agents aren’t gods; they’re tools with curious failures.
Ejecutar Claude en paralelo: sincronización simple y efectiva
The collaboration architecture was deliberately minimal and based on Git:
- An empty "upstream" repo mounted into each container.
- Each agent clones to
/workspace, works, then pushes to upstream. - To avoid two agents doing exactly the same task there’s a lock system: an agent writes a file in
current_tasks/(for exampleparse_if_statement.txt) to claim the task. - If two agents try to claim the same task, Git synchronization forces one to back off and pick another task.
- Typical flow: take lock, work,
git pull, resolve merges,git push, delete lock.
Merge conflicts were frequent, but Claude was able to resolve many of them autonomously. There’s no master orchestrator: each Claude decides its next step, which simplifies the design but leaves some high-level choices uncontrolled.
Diseño de pruebas: el alma del sistema
If an agent is going to work autonomously, you need impeccable tests. Claude tends to optimize for what you tell it to test; if the verifier is bad, the agent optimizes in the wrong direction.
Key lessons:
- Write robust verifiers and build scripts that return outputs easy to parse. For example, when there’s an error the output should contain the word
ERRORon the same line as the cause, sogrepand the agents find it without filtering megabytes of logs. - Avoid “context window pollution.” Don’t print thousands of bytes to STDOUT; save logs and show clear summaries. Precompute aggregated stats so the model doesn’t have to redo work.
- Mitigate “time blindness”: agents don’t know how long a test takes. They implemented sampling options (
--fast) that run a deterministic 1% or 10% of tests per agent, so each instance covers different portions without repeating the whole corpus. - Strict continuous integration: when the compiler started breaking things with each new feature, they added pipelines that blocked commits causing regressions.
In short: tests aren’t auxiliary, they’re the main interface between humans and agents.
Paralelismo efectivo y oráculos conocidos
Parallelizing independent tests is easy: each agent hunts a different bug. The headache came with the kernel, which is one big job. Many agents hit the same bug, stepped on each other, and didn’t make progress.
The elegant solution was to use GCC as a “known-good oracle.” The idea was to compile most kernel files with GCC and only compile a subset with the Claude compiler. If it fails, you shrink the subset attacked by Claude until you localize problematic files. That technique, combined with delta debugging to find interactions between file pairs, let them parallelize work even within an apparently monolithic task.
Roles de agentes y especialización
One advantage of many agents is creating roles. Practical examples they used:
- Agents dedicated to coalescing duplicated code.
- Agents that optimize compiler performance.
- Agents focused on generating more efficient target code.
- Agents that review design quality from a Rust developer’s perspective.
- Agents that maintain and update documentation and READMEs so new containers orient quickly.
Specialization reduces redundancy and speeds convergence, though it requires designing tasks/locks so roles don’t constantly conflict.
Evaluación: números que importan
- Models: Opus 4 series (4.5, 4.6 was the one that crossed the threshold).
- Sessions: ~2,000 Claude Code sessions over two weeks.
- Tokens: 2B in, 140M out.
- Cost: ~20,000 USD in API spend.
- Result: ~100k-line compiler that compiles Linux 6.9 on x86, ARM, RISC-V; compiles QEMU, FFmpeg, SQLite, Postgres, Redis; ~99% on many test suites; compiles and runs Doom.
That cost may seem high, but the author notes it’s a fraction of what it would cost to assemble a human team to do the same work in the same time.
Límites técnicos y riesgos
Despite the success there’s tech debt and clear limits:
- 16‑bit support for x86: Claude couldn’t produce compact codegen within the 32 KB limit Linux requires for real mode, so the boot flow still depends on GCC for that phase.
- Assembler and linker: the project still uses external tools in some phases because Claude’s latest efforts on those parts were unstable.
- Rust code quality: good enough to work, but not at expert human level.
- Efficiency of generated code: even with optimizations the output is less efficient than GCC without optimizations. Plenty of room for improvement.
- Fragility: new features frequently broke existing functionality; strict tests and CI are central.
Beyond the technical there are trust risks: would you deploy critical software you never manually verified? The author, with a background in penetration testing, warns that passing tests aren’t equivalent to exhaustive human verification.
Qué nos deja esto y hacia dónde va
This experiment shows that the boundaries of what’s possible with LLMs move fast: from autocompleting functions in an IDE to coordinating teams of agents that deliver complex artifacts. But it also reminds us that “functional” isn’t the same as “production-ready.”
If you work on developer tools or on integrating LLMs into pipelines, what can you take away? Design tests as if they were contracts, log with models in mind, split work to enable real parallelism, and use known oracles when a task is inherently monolithic.
And ethics and security? We need processes to validate, audit, and gate deployments that were generated mostly autonomously. The experiment is exciting, but it ties technical urgency to responsibility.
Original source
https://www.anthropic.com/engineering/building-c-compiler
Summary: Anthropic coordinated 16 Claude instances in parallel to create, in Rust, a C compiler that can compile Linux 6.9; the project used almost 2,000 sessions, 2 billion tokens and cost ~20,000 USD. The real value is the technical lessons: harness design, robust tests, Git lock synchronization, and using GCC as an oracle to parallelize monolithic tasks.
