Claude Opus 4.5 marks a meaningful advance in resisting prompt injections when agents act inside the browser. Anthropic says the Claude extension for Chrome is moving from research preview to beta for Max plan users, but they also remind us the problem is still active and the web remains an adversarial environment that needs ongoing work.
What prompt injection is and why the browser is so vulnerable
Can you imagine asking an agent to read your emails and, without you noticing, one message instructs it to send information to a third party? That is prompt injection: adversarial instructions hidden inside the content the model processes.
In the browser the risk grows for two clear reasons. First, the attack surface is huge: pages, embedded documents, ads, dynamic scripts, forms. Second, browser agents perform many actions — navigate, fill forms, click, download files — that an attacker can try to manipulate if they influence the agent's behavior.
A concrete example: an email with white-colored text or a manipulated image that contains hidden instructions. You don't see it, the agent does, and if the agent misses the trap it can leak data or take unwanted steps.
What Anthropic achieved with Claude Opus 4.5
Anthropic reports improvements on two fronts: model robustness and operational safeguards. They tested the extension against an adaptive internal attacker, Best-of-N, that combines known prompt injection techniques. The new model, Claude Opus 4.5, significantly reduces the success rate of those attacks. Anthropic shows results where a 1% success rate is a notable improvement, but not a final fix.
A 1 percent success rate still represents a real risk in high-value scenarios.
The work focused on these areas:
-
Robustness-oriented training. They use reinforcement learning so Claude learns to identify and refuse malicious instructions in simulated web content. The model is rewarded when it detects and refuses adversarial instructions, even if they seem authoritative or urgent.
-
Classifiers before and during inference. Any untrusted content that enters the
context windowis scanned by classifiers looking for injection patterns: hidden text, manipulated images, deceptive UI elements. When risk is detected, the model's behavior is adjusted by an intervention policy. -
Human red teaming at scale. Internal security researchers and external Arena-style exercises help uncover creative vectors that automated systems miss. Human work is key to finding new attack techniques.
-
Product improvements. The Chrome extension incorporates these safeguards and controls; it's now in beta for Max users, which enables broader testing and real feedback.
Important technical aspects
-
Using
RLfor robustness means exposing the model to simulated attacks during training and optimizing rewards for safe behavior. -
Classifiers act as a pre-defense layer: they detect anomalies in input and can trigger an intervention that changes the instruction reaching the main model.
-
Evaluation with a
Best-of-Nattacker reflects an adaptive strategy that tries multiple techniques and combines the ones that work — a more realistic way to measure resilience. -
The attack success rate metric is useful, but it must be read in context: 1% may be acceptable for low-risk tasks and dangerous for operations that handle secrets or financial actions.
Practical recommendations for deploying agents in browsers
If you're going to integrate agents that browse or interact with the web, what measures should you take right now?
-
Limit the agent's scope and privileges: apply least privilege. Don't grant more permissions than strictly necessary.
-
Sanitize and prefilter content before processing it with the model. Use classifiers to block or flag suspicious inputs.
-
Implement human confirmations for sensitive actions: data exfiltration, transfers, publishing information.
-
Keep audit trails and logs of agent actions: who requested what, which pages were visited and what decisions were made.
-
Run red teaming and adversarial tests regularly. Combine automated tools with human teams to find creative vectors.
-
Update models and classifiers frequently. Attack techniques evolve and defenses must too.
-
Consider operational limits: number of automated clicks, forms completed, downloads allowed.
The road ahead
Anthropic's research shows real progress: training models to resist adversarial instructions, pairing them with classifiers and subjecting them to red teaming produces security improvements. Still, the web is an adversarial space and there is no silver bullet.
Does that mean you shouldn't use browser agents? No. It means you must design, control, and monitor them. Advances like Claude Opus 4.5 are important steps, but the responsibility for safe deployment falls on product, security, and operations teams.
If you work on this, participating in robustness exercises and collaborating with research teams is one of the most effective ways to improve collective security.
Source
https://www.anthropic.com/research/prompt-injection-defenses
