Claude Code shows persistent returns to expertise

Jun 16, 2026Keryc Díaz5 minutes

Anthropic published a technical analysis of ~400,000 Claude Code sessions between October 2025 and April 2026. What do people do when they use an agent that can run code on its own? Who makes the decisions, and what matters for a session to succeed? This text summarizes the main findings, the methodology, and the implications for both technical and non-technical work.

Key findings

The study uses a framework to describe interactive and agentic coding use, based on privacy-preserving analysis of ~400,000 sessions. It evaluates task types, human-AI collaboration, and success rates.
On average, people make most of the planning decisions (what to do, what counts as done) while Claude makes most of the execution decisions (which files to change, what code to write). The split is clear: ~70% of planning decisions are made by the user and only ~20% of execution decisions.
A user’s domain experience matters more than coding skill: across users, the probability of success rises with subject-matter experience. The biggest jump is from novice to intermediate; the difference between intermediate and expert is smaller.
Over time (October 2025 to April 2026) time spent debugging dropped by nearly half and usage shifted toward end-to-end flows: deploy and run code, analyze data, and produce non-code documents. Estimated economic value per task rose on average ~25–27%.

Methodology and metrics (technical but clear)

Data source: Claude Code sessions from the CLI, claude.ai and the desktop app; clear exclusions (embedded use in SDKs/IDEs or headless calls like claude -p "<prompt>"). The analysis is private and aggregable.
Session classification: each session is assigned to one of nine work modes (create, fix, test, orchestrate, operate software, understand, plan, analyze data, write documents). A model reads the transcript and is validated against telemetry (for example, real changes to files). Agreement >90% in several categories.
Decision attribution: they built a classifier that detects meaningful decisions and labels them as planning or execution, attributing each decision to the user or to Claude. Typical result: the user controls planning, Claude controls execution.
Measures of autonomous activity: a typical session has ~4 turns. Each user prompt triggers chains of actions by Claude: on average ~10 actions per prompt (long tail: 2% >100 actions/prompt). Claude produces around 2,400 words of output per turn when acting autonomously.
Expertise classifier: a 5-level scale (novice to expert) that detects signals like precision in instructions, what the user asks to verify and who corrects whom. Expertise is task-specific (an accounting expert can be a Rust novice).
Verified success vs judgment: they use a judgment classifier (did they achieve what they wanted?) and a classifier of verifiable signals (commits, passing tests, explicit confirmation). The combination gives a measure of “verified success.”

Division of labor in practice

Structure: when the user retains control of execution (over 80%), Claude performs fewer actions per turn (~8). If Claude assumes planning, the chain becomes longer (~16 actions).
Differences by experience: in novice sessions each prompt triggers ~5 actions and ~600 words; in expert sessions the same prompt triggers ~12 actions and ~3,200 words. In other words, experience amplifies how much autonomous work the agent does per instruction.

Who uses Claude Code and what they do

Occupations: they classified occupations into 23 SOC groups using project signals (files, artifacts referenced, vocabulary). They don’t label someone as “software” just because they write code.
Composition: ~56% of sessions are writing, fixing, or testing/orchestrating code; 17% operate software; 14% plan/explore; 13% analyze or write documents.
Changes over time: the fraction spent fixing code fell from 33% to 19%. Software operation rose from 14% to 21%. Writing and analysis increased notably.
Estimated value: comparing tasks to freelancer listings, median estimated value per session grew ~27% between October and April. This is a relative approximation for trend comparison, not an absolute price.

Returns to expertise and success

Success patterns: sessions classified as novice reach verified success ~15% and at least partial success 77%. Intermediate–expert sessions reach verified success ~28–33% and partial success ~91–92%.
Recovery from problems: when a failure arises, experts turn those episodes into success more often than novices. Among sessions with trouble, verified success rises from 4% (novice) to 15% (expert).
Abandonment: novices abandon (fail and don’t write lines of code) more often (19%) versus 5–7% for more experienced users.
Occupation vs expertise: workers outside software have verified success rates close to software engineers (small difference: ~5 points or less in code-producing sessions). Largely, session-demonstrated experience matters more than job title.

What this means for work and adoption

Agents amplify the ability of domain-knowledgeable people to produce technical work without being professional programmers. Does this democratize software production? Partly: domain experience helps you better direct the agent and get higher-value results.
It doesn’t replace expertise: Claude can implement, but understanding the problem remains the main success factor. If over time models begin to provide that essential judgment, we might see returns to expertise decline; that’s a signal to monitor.
Labor-supply changes: if trends continue, more technical tasks could become part of day-to-day work across professions; the valued skills may shift toward diagnosis and problem specification rather than low-level implementation.

Limitations and open questions

No real-world outcomes observed: the analysis is based on transcripts, verifiable signals (commits, tests) and value estimators. It doesn’t measure if code is used in production or its real economic impact.
Exclusions: headless sessions and embedded use in IDEs/SDKs are out. It’s possible that a significant share of non-interactive use isn’t reflected here.
Classifier validation: models read transcripts and classifiers show good agreement with telemetry, but large-scale validation is challenging. Anchoring conclusions in multiple sources and repetitions is key.
Dynamic signals: the relationship between expertise and success can change as models improve. Monitoring temporal evolution is crucial to understand labor-market impact.

Conclusion

The report paints a clear picture: code agents are taking on execution while people keep defining the what. Domain expertise strongly amplifies what an agent can achieve from an instruction, and that expertise matters more than job title. Practically, learning the domain you care about can give you much more leverage than learning just to code when you work with agents like Claude Code.

Original source

https://www.anthropic.com/research/claude-code-expertise

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

Key findings

The study uses a framework to describe interactive and agentic coding use, based on privacy-preserving analysis of ~400,000 sessions. It evaluates task types, human-AI collaboration, and success rates.

On average, people make most of the planning decisions (what to do, what counts as done) while Claude makes most of the execution decisions (which files to change, what code to write). The split is clear: ~70% of planning decisions are made by the user and only ~20% of execution decisions.

A user’s domain experience matters more than coding skill: across users, the probability of success rises with subject-matter experience. The biggest jump is from novice to intermediate; the difference between intermediate and expert is smaller.

Over time (October 2025 to April 2026) time spent debugging dropped by nearly half and usage shifted toward end-to-end flows: deploy and run code, analyze data, and produce non-code documents. Estimated economic value per task rose on average ~25–27%.

Methodology and metrics (technical but clear)

Data source: Claude Code sessions from the CLI, claude.ai and the desktop app; clear exclusions (embedded use in SDKs/IDEs or headless calls like claude -p "<prompt>"). The analysis is private and aggregable.

Session classification: each session is assigned to one of nine work modes (create, fix, test, orchestrate, operate software, understand, plan, analyze data, write documents). A model reads the transcript and is validated against telemetry (for example, real changes to files). Agreement >90% in several categories.

Decision attribution: they built a classifier that detects meaningful decisions and labels them as planning or execution, attributing each decision to the user or to Claude. Typical result: the user controls planning, Claude controls execution.

Measures of autonomous activity: a typical session has ~4 turns. Each user prompt triggers chains of actions by Claude: on average ~10 actions per prompt (long tail: 2% >100 actions/prompt). Claude produces around 2,400 words of output per turn when acting autonomously.

Expertise classifier: a 5-level scale (novice to expert) that detects signals like precision in instructions, what the user asks to verify and who corrects whom. Expertise is task-specific (an accounting expert can be a Rust novice).

Verified success vs judgment: they use a judgment classifier (did they achieve what they wanted?) and a classifier of verifiable signals (commits, passing tests, explicit confirmation). The combination gives a measure of “verified success.”

Division of labor in practice

Structure: when the user retains control of execution (over 80%), Claude performs fewer actions per turn (~8). If Claude assumes planning, the chain becomes longer (~16 actions).

Differences by experience: in novice sessions each prompt triggers ~5 actions and ~600 words; in expert sessions the same prompt triggers ~12 actions and ~3,200 words. In other words, experience amplifies how much autonomous work the agent does per instruction.

Who uses Claude Code and what they do

Occupations: they classified occupations into 23 SOC groups using project signals (files, artifacts referenced, vocabulary). They don’t label someone as “software” just because they write code.

Composition: ~56% of sessions are writing, fixing, or testing/orchestrating code; 17% operate software; 14% plan/explore; 13% analyze or write documents.

Changes over time: the fraction spent fixing code fell from 33% to 19%. Software operation rose from 14% to 21%. Writing and analysis increased notably.

Estimated value: comparing tasks to freelancer listings, median estimated value per session grew ~27% between October and April. This is a relative approximation for trend comparison, not an absolute price.

Returns to expertise and success

Success patterns: sessions classified as novice reach verified success ~15% and at least partial success 77%. Intermediate–expert sessions reach verified success ~28–33% and partial success ~91–92%.

Recovery from problems: when a failure arises, experts turn those episodes into success more often than novices. Among sessions with trouble, verified success rises from 4% (novice) to 15% (expert).

Abandonment: novices abandon (fail and don’t write lines of code) more often (19%) versus 5–7% for more experienced users.

Occupation vs expertise: workers outside software have verified success rates close to software engineers (small difference: ~5 points or less in code-producing sessions). Largely, session-demonstrated experience matters more than job title.

What this means for work and adoption

Agents amplify the ability of domain-knowledgeable people to produce technical work without being professional programmers. Does this democratize software production? Partly: domain experience helps you better direct the agent and get higher-value results.

It doesn’t replace expertise: Claude can implement, but understanding the problem remains the main success factor. If over time models begin to provide that essential judgment, we might see returns to expertise decline; that’s a signal to monitor.

Labor-supply changes: if trends continue, more technical tasks could become part of day-to-day work across professions; the valued skills may shift toward diagnosis and problem specification rather than low-level implementation.

Limitations and open questions

No real-world outcomes observed: the analysis is based on transcripts, verifiable signals (commits, tests) and value estimators. It doesn’t measure if code is used in production or its real economic impact.

Exclusions: headless sessions and embedded use in IDEs/SDKs are out. It’s possible that a significant share of non-interactive use isn’t reflected here.

Classifier validation: models read transcripts and classifiers show good agreement with telemetry, but large-scale validation is challenging. Anchoring conclusions in multiple sources and repetitions is key.

Dynamic signals: the relationship between expertise and success can change as models improve. Monitoring temporal evolution is crucial to understand labor-market impact.

Conclusion