Anthropic publishes safeguards and framework for AI jailbreaks | Keryc
Claude Fable 5 is available globally again, and Anthropic uses the moment to explain two key things: how they’re blocking dangerous cyber uses with safety classifiers and how they propose to measure the severity of AI “jailbreaks”. Why does this matter to you? Because we’re talking about models that can help defenders and, if not well controlled, also help attackers.
What Anthropic announced about Fable 5
The main news is twofold. First, they detail the safety classifiers that now accompany Fable 5 to detect and block dangerous cyber uses. Second, they present a draft framework to assess the severity of model jailbreaks, created with industry partners.
This isn’t a sealed technical manifesto: it’s a practical attempt to balance legitimate AI use in security with the need to prevent abuse. Sound familiar? Many security controls are dual use: useful to protect, useful to attack.
How the classifiers work and what they block
Anthropic isn’t trying to block everything related to cybersecurity. Instead, they train classifiers to distinguish four categories of use, from clearly prohibited to benign. Basically, they want to let through what helps defenders and block what could facilitate attacks.
Prohibited: activities with little defensive benefit and high potential for harm. This blocks anything related to ransomware, digital sabotage, evading defenses, C2, exfiltration, development or delivery of malware, and attacks on critical infrastructure like BGP or CAs.
High-risk dual use: techniques that legitimate professionals use in testing (penetration testing, red teaming, exploitation), but that are dangerous in the wrong hands. Anthropic intends to block these until they have finer access controls.
Low-risk dual use: activities that tend to be defensive and are usually allowed, though with a “safety margin” that can block requests out of caution. Examples: OSINT, scanning public systems, or vulnerabilities that other tools already surface.
Benign: tasks clearly defensive or educational, like secure coding, debugging, network administration, patch management, log analysis, incident response, training, and public content.
Beyond classifiers, Anthropic uses access controls, model training, and offline monitoring. And they acknowledge the “safety margin” can and should be adjusted over time.
The proposed framework for measuring jailbreak severity (CJS)
An interesting part is the attempt to create a common language to describe how dangerous a jailbreak is. Why do we need a framework? Because not all jailbreaks are equal: some unlock minor things, others turn a model into a powerful attack tool.
Anthropic proposes the Cyber Jailbreak Severity (CJS), a five-level scale: CJS-0 (informational), CJS-1 (low), CJS-2 (medium), CJS-3 (high) and CJS-4 (critical). The score is calculated by adding four axes:
Capability gain (uplift): how much the jailbreak increases an attacker’s power compared to tools already available.
Breadth (universality): how many different offensive tasks the same technique can enable.
Ease of weaponization: how much skill or engineering an attacker needs to turn it into a real attack.
Discoverability: how easy it is to find or obtain the technique.
Each axis has its own scale and examples. The sum gives an initial floor score; reviewers can raise it if there are reasons to think the real risk is higher.
Examples to understand it better
Anthropic gives hypothetical and some historical examples to illustrate the framework.
A public string that disables all protections and gets shared on social media would be CJS-4 because it accelerates and enlarges the damage.
A method to split a malicious request into benign sub-requests and reassemble it had an example at CJS-3: dangerous because it can generalize, even if it requires some assembly.
Finding a vulnerability already known to the community greatly reduces severity: if the information is already in public scanners, the model isn’t increasing the attacker’s capability and it could be CJS-0.
These examples show severity depends not only on the jailbreak itself, but on context and what tools already exist.
What changes for users, defenders and the research community
If you use Fable 5 to improve security, many defensive tasks will remain allowed, but you may hit blocks when a request looks too close to what an attacker would use.
If you’re a security researcher, Anthropic opened a formal channel to submit findings: the email cyber-safeguards@anthropic.com and a program on HackerOne where possible cyber jailbreaks can be reported.
For regulators and response teams, the CJS framework provides useful vocabulary to assess risks consistently and coordinate responses.
Does this mean AI is fully under control now? No. It means there’s a serious effort to balance legitimate access and abuse prevention, and that this balance will be tuned based on experience and public discussion.
Final reflection
Anthropic’s bet is pragmatic: don’t block everything and choke defensive use, but create layers of protection and a framework to speak clearly about risks. It’s an invitation to the community: if you want to help refine the controls or polish the severity framework, there are ways to contribute. Want to get involved? Good security is built together.