Claude Code launches Auto Mode to safely skip permissions | Keryc
Claude Code introduces Auto Mode, a technical option that looks for a middle ground between asking for human approval at every step and completely turning off protections. The goal? Reduce approval fatigue without giving away security.
What problem does Auto Mode solve
If you've worked with agents, this will sound familiar: the model asks permission for every command and you end up clicking without really looking. Sound like a pattern you've seen? That approval fatigue leads to human errors and slow workflows. The extreme alternative, --dangerously-skip-permissions, removes prompts but leaves the system exposed.
Auto Mode offers a third path: delegate approvals to model-based classifiers. The idea is simple: block what's dangerous and let routine stuff pass — all without overwhelming you.
Defense architecture: two layers
Auto Mode runs two complementary layers: an input layer and an output layer.
Input layer: a server-side probe that detects prompt injection attempts in tool outputs (file reads, web fetches, shell output, external responses). If content looks malicious, the probe adds a warning to the agent's context so the agent treats that info as suspicious and sticks to what you asked.
Output layer: a transcript classifier that runs on Sonnet 4.6 and evaluates each action before execution. It works in two stages: a fast single-token filter (yes/no) and, only if that filter flags the action, a chain-of-thought style reasoning stage to reduce false positives.
This combo aims to let most actions pass with no reasoning latency, and only spend costly reasoning when it's necessary.
Rules by tiers: what reaches the classifier
Auto Mode applies an allowlist and tiered rules to minimize the classifier's scope and focus it where it matters:
Tier 1: Allowlist of safe tools and user settings. Includes only read-only operations, searches, code browsing, and actions that don't change state. When Auto Mode is enabled, rules that allow arbitrary execution are removed (broad shell access, interpreters with wildcards, package manager commands).
Tier 2: File operations inside the project directory. Routine writes and edits within the repo still bypass the classifier; it's assumed version control provides auditing.
Tier 3: Everything else. Shells, web fetches, external integrations, subagent spawns, and operations outside the project directory go to the transcript classifier.
Only actions with real risk reach the classifier, which reduces cost and latency.
How the classifier decides
The classifier prompt is a fixed template with three customizable slots that define the user's policy (what's trusted, what to block, exceptions). Decisions follow three principles:
Evaluate real impact: the classifier looks at the effect of the action, not just the text. If the agent writes a payload and then executes it, the classifier evaluates the payload. If there are concatenated command strings, it evaluates the whole set.
Read the user's intent: the classifier is conservative. Anything the agent does on its own initiative is unauthorized until you explicitly approve it. "Clean up my branches" doesn't by default authorize a mass deletion.
Define the trust boundary: by default, the current git repo is the only trusted scope. You can extend this (GitHub orgs, buckets, internal APIs). Outside that boundary, the risk of exfiltration exists.
Technical design of the pipeline
Stage 1: fast filter (single-token) instructed to "err on the side of blocking." Its mission is high recall so dangerous actions don't slip through. That produces more false positives.
Stage 2: chain-of-thought reasoning to clean false positives. It only runs on items flagged by stage 1, which amortizes the compute cost.
It's also designed so the classifier doesn't see tool outputs or assistant messages. It only receives user messages and tool calls. That prevents the agent from "chatting" the classifier into approval.
Data and metrics (evaluation)
Tested on three datasets: real traffic, real overeager actions, and synthetic exfiltration. They report FPR (false positive rate) and FNR (false negative rate):
Dataset
Stage 1 (filter)
Pipeline complete (Stage 1 -> Stage 2)
Real traffic n=10,000
8.5% FPR
0.4% FPR
Overeager real n=52
6.6% FNR
17% FNR
Synthetic exfiltration n=1,000
1.8% FNR
5.7% FNR
The candid number to note: 17% FNR on the real overeager actions after the whole pipeline. That means the classifier lets 17% of risky actions slip through because they were aimed at solving your task. Is that acceptable? It depends: it's a big improvement over --dangerously-skip-permissions, but it doesn't replace human review for critical infrastructure.
Defense against prompt injection and chained attacks
Auto Mode doesn't rely on the classifier alone. If malicious content enters via a tool and manages to manipulate the agent, the input probe tries to detect the injection and tags the piece as suspicious before it reaches the agent.
For an attack to succeed end-to-end, it must escape the probe's detection and also trigger a tool call that the classifier judges safe and aligned with the user's intent. Beating both defenses is much harder than bypassing just one.
Also, the classifier runs on both sides of delegating to subagents: when delegating work and when receiving results. That prevents a subagent from getting a task that looks authorized just because orchestration presented it that way.
Handling blocks and user experience
When the classifier blocks, Claude doesn't hang forever. Instead it returns a denial as the tool result with guidance to find a safer path and not to try to circumvent the restriction.
There is a backstop: after 3 consecutive denials or 20 total, the session halts and escalates to a human. In headless mode (claude -p) the process terminates. This rule protects against compromised or overly persistent agents.
False positives are low (0.4% FPR) but not zero. Auto Mode is designed so a false positive allows retries and nudges instead of breaking long flows.
Customization and deployment
The classifier template includes three configurable slots: the trust policy, categories to block, and exceptions. They ship with conservative defaults. You can start with claude auto-mode defaults and then tweak settings instead of rewriting everything.
Practical recommendation: use Auto Mode when compared to --dangerously-skip-permissions or when human approval overload isn't worth the risk. Don't use it as a substitute for human review in high-risk tasks.
Final reflection
Auto Mode doesn't promise perfection; it promises a better balance between autonomy and safety. It's defense in depth: input probe, two-stage transcript classifier, tiered rules, and adjustable policies. For teams that want fewer clicks but still need to protect critical infrastructure, it's a useful tool.
What's next? Improve the overeagerness test set, refine the template, and reduce FNR. Meanwhile, the practical advice is to know Auto Mode's limits, configure your trust boundaries well, and keep human review where risk is unacceptable.