Anthropic and CAISI/AISI strengthen AI model security

Anthropic announced on September 12, 2025 that it worked directly with the US Center for AI Standards and Innovation (CAISI) and the UK AI Security Institute (AISI) to evaluate and improve the defenses of its models, including tests on Claude versions like Opus 4 and 4.1. (anthropic.com)

What Anthropic, CAISI and AISI did

The collaboration started as voluntary consultations and evolved into sustained access for CAISI and AISI teams at different stages of model development. In practice, that meant government teams could test prototypes before deployment and iterate alongside Anthropic’s engineers.

Why does that matter? Because some vulnerabilities only show up over time, not in one-off tests. Continuous access helped find complex failure modes that a single evaluation would miss. (anthropic.com)

CAISI and AISI are organizations built to assess real-world AI security risks. They bring expertise in threat analysis, cybersecurity, and attack modeling — skills that complement what model developers already do. (nist.gov)

Key findings: what vulnerabilities they found

Anthropic lists several attack types that external teams discovered during red-teaming and ongoing testing:

Prompt injections that exploited fake annotations, like pretending there was a human review to evade detectors. (anthropic.com)
Universal jailbreaks built through automated iterations, forcing a rethink of safeguard architecture rather than patching single exploits. (anthropic.com)
Attacks using encodings and character substitutions to hide malicious prompts. (anthropic.com)
Input and output obfuscation, splitting harmful strings into benign contexts to slip past filters.
Automated systems that refine attacks step by step, which increases the need for defenses that don’t rely only on static rules. (anthropic.com)

These findings were practical, not hypothetical. Anthropic used them to patch specific issues, improve classifiers, and reorganize parts of its safeguard architecture. (anthropic.com)

Practical lessons and an effective approach

From this joint work come several lessons any AI developer can apply:

Broad, controlled access to systems improves detection of serious vulnerabilities. Testing only production models or isolated evaluations leaves gaps. (anthropic.com)
Iterative testing and daily communication with external teams reveal complex attack vectors that need time and context to surface. (anthropic.com)
A multi-layered strategy is stronger: internal audits, public bug-bounty programs, and specialized red-teaming exercises complement each other. (anthropic.com)

These practices align with the public role CAISI and AISI aim to play: facilitating evaluations, developing frameworks, and collaborating with industry to improve the security of commercial AI systems. (nist.gov)

What this means for companies, teams and users

If you work in AI product or security, the takeaway is clear: simple rules or static filters aren’t enough. You need a process that adapts.

Test versions without safeguards, then iterate toward stronger defenses.
Keep channels open with external evaluators and, when possible, independent assessment bodies.
Combine automated tests with specialized human audits and bug bounty programs.

For end users and decision-makers, this shows some companies are investing in rigorous external testing. But it also reminds you that threats evolve quickly and security is an ongoing process, not a fixed goal. (anthropic.com)

Final reflection

This collaboration between a model company and specialized agencies isn’t PR theater. It’s a bet on building stronger defenses through controlled transparency and joint work. Does that mean models are invulnerable? No. It means finding and fixing flaws requires access, time, and diverse testing techniques.

If you want to see how these efforts translate into concrete changes for products you use every day, we can review specific technical safeguards or create a checklist for product and security teams.

Stay up to date!

Receive practical guides, fact-checks and AI analysis straight to your inbox, no technical jargon or fluff.