Anthropic measures AI agent autonomy in practice

Feb 18, 20265 minutes

Anthropic publishes an empirical analysis of how people use AI agents in the real world: how much they let them act without supervision, which domains they’re used in, and how risky the actions are. Surprised? The practical autonomy still lags behind what models can do, but it’s growing fast in the long tail of more ambitious uses.

Qué midieron y por qué importa

Why read this if you don’t work in AI? Because agents are already in your email, in developer tools, and starting to creep into finance and health. Understanding how much human control remains and where risks emerge is key to designing safer products, useful rules, and smart policies.

Anthropic analyzed millions of interactions using two complementary sources:

Claude Code: lets you reconstruct full sessions and measure how long the agent works without human intervention (turn duration) and how patterns change with user experience.
Public API traffic: broad coverage of real deployments across thousands of customers, analyzed at the level of tool calls. It gives breadth but can’t reassemble entire sessions.

Both sources were processed with privacy-preserving infrastructure to extract metrics on autonomy, risk, and human oversight.

Metodología y métricas (technical)

Operational definition: an agent is a system that uses tools (run code, call external APIs, send messages) to act on the world. Measuring it involves two approaches with tradeoffs:

Public API: analyzes individual tool calls. Advantage: scale and diversity. Limit: can’t reconstruct long flows or sessions.
Claude Code: full sessions. Advantage: you can measure turns, pauses, and intervention rates. Limit: biased toward software engineering.

Key metrics used:

turn duration: time between start and stop (by completion, agent question, or human interruption). Serves as a proxy for practical autonomy.
extreme percentiles (99.9th) to capture very ambitious uses.
risk and autonomy scores from 1 to 10, assigned by Claude on each tool call to group actions into clusters.
indicators of human presence and safeguards (permissions, manual approvals, irreversible actions).

Important: these measures compare what happens in practice against controlled capability evaluations (for example METR). They aren’t interchangeable, but they complement each other.

Hallazgos clave

The long tail of autonomy is growing. In Claude Code, the longest turns (99.9th) nearly doubled between Oct 2025 and Jan 2026: from ~25 minutes to ~45 minutes. The median remains short (~45 seconds), but that tail suggests increasingly ambitious tasks.
Experienced users trust more, but watch closely. The share of sessions with full auto-approve rises from ~20% for new users to >40% for users with hundreds of sessions. At the same time, interruption rates per turn increase (from ~5% for novices to ~9% for experienced users). Contradiction? Not really: users move from step-by-step approvals to monitoring and correcting when needed.
The agent also supervises. Claude Code asks clarifying questions more often on complex tasks and does so more than human interruptions in those cases. Teaching a model to recognize its uncertainty and stop is a form of safety.
Risk is present but concentrated. In public traffic, most tool calls are low-risk and reversible. Quantitative findings: ~80% of calls come from agents with at least one safeguard, ~73% show some form of human-in-the-loop (probably an upper bound), and only ~0.8% look irreversible (for example sending a real email).
Initial domain: software. Almost 50% of agentic activity is software engineering. Early experiments appear in health, finance, and cybersecurity, but not yet at scale.
Some actions are at the edge. Clusters with high autonomy and/or high risk exist (automatic trade execution, access to medical records, intrusion testing), although many could be evaluations or red-teaming rather than production operations.

Comparación con evaluaciones de capacidad

Capability evaluations like METR measure what a model can do under ideal conditions. This study measures what agents do in real deployment: pauses, questions, human interruptions, and safeguards reduce practical latitude. Result: technical capability exceeds autonomy exercised in current deployments, but the gap is shrinking in the tail of ambitious uses.

Limitaciones importantes (sé crítico con los números)

Only Anthropic data: other platforms may show different patterns.
Two complementary but imperfect sources: the public API over-represents workflows with many tool calls; Claude Code is biased toward software.
Classifications and scores were generated by Claude, and there was no direct human inspection of each item for privacy reasons.
Time window: late 2025 to early 2026. The landscape changes fast.
No robust way to join independent API requests into coherent sessions, which limits inferences about long flows.

Recomendaciones prácticas (para desarrolladores y reguladores)

Invest in post-deployment monitoring. What happens in production can diverge a lot from prior evaluations; you need infrastructure that tracks how tools are used over time and in context, preferably preserving privacy.
Train models to recognize and communicate uncertainty. Having the agent ask before acting is active safety; it’s a trainable and practical property.
Design interfaces for effective supervision, not for ritual compliance. Forcing manual approval for every action can create friction without improving safety; better to provide reliable visibility and simple mechanisms to intervene.
Avoid rigid mandates about interaction patterns. More useful than demanding permanent approvals is ensuring humans are positioned to monitor and stop the agent when it matters.

Reflexión final

Autonomy in agents isn’t only a property of the model: it’s co-constructed by the model, the user, and the product. Anthropic’s numbers show practice still trails capability, but adoption and task complexity are rising fast. If we design human oversight, post-deployment monitoring, and models that know when to ask for help, we can scale useful agents without giving up safety.

Fuente original

https://www.anthropic.com/research/measuring-agent-autonomy

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.