Language models remain attractive targets for jailbreaks: techniques that try to bypass protections and extract dangerous information. Anthropic presents a new generation of Constitutional Classifiers—Constitutional Classifiers++—that promise better robustness, fewer false negatives, and a nearly negligible compute cost.
What changed and why it matters
The first generation of constitutional classifiers already achieved a notable reduction: successful jailbreaks fell from 86% to 4.4% compared to an unprotected model. But that came with two real problems: a 23.7% increase in compute cost and a 0.38% rise in rejection rate for innocuous questions.
Constitutional Classifiers++ aims to keep (and improve) protection against attacks, but without those costs. So, what happened? Better robustness, a rejection rate for benign queries of only 0.05% in real Claude Sonnet 4.5 traffic (an 87% drop versus the first version), and an approximate compute impact of 1% if applied to Claude Opus 4.0.
Key data: more protection, less friction for the user, and almost no overhead in production.
How it works technically (clear, practical version)
The core idea was to combine several techniques into an ensemble. The important pieces are exchange classifier, cascade architecture, and linear probes that reuse the model’s internal activations.
Two-stage architecture and exchange classification
Previously classifiers checked inputs and outputs separately. That left a gap: attacks that hide harmful connections between input and output could slip through. The improvement was an exchange classifier that evaluates the output in the context of the input; this catches jailbreaks that rely on that relationship much better.
In tests with human red teams, that approach cut more than half of successful attacks. The downside: it increased compute cost by 50%, which forced a further mitigation.
Cascade architecture: cheap until you need to spend more
To control cost, they implemented a cascade architecture: a lightweight, cheap classifier reviews all traffic. Only conversations flagged there move on to a more powerful, costly classifier. Clear advantages:
The lightweight classifier can tolerate more false positives, because it doesn’t stop the conversation—it scales.
Most traffic is processed cheaply, reducing overall overhead.
This idea keeps the robustness of the exchange classifier without multiplying operational cost.
Internal probes: leverage what the model already computes
Here comes the most interesting technical part: when a model generates text it produces internal activations step by step. Anthropic trained linear probes (simple linear classifiers) that read those activations to infer early signals of potentially harmful content.
Why does this matter?
It’s cheap because it reuses compute the neural network already performed.
It’s harder to fool: manipulating the final output is easier than manipulating the model’s internal activation structure.
It complements external classifiers: probes see patterns that traditional classifiers don’t, and vice versa.
The final production system combines a linear probe that filters everything and, if needed, scales to an ensemble of probes and external classifiers for the final decision.
Detected vulnerabilities and how they address them
Anthropic documents two attack categories that caused trouble before:
Reconstruction attacks: they split dangerous information into benign-looking pieces that are later reassembled. For example, hiding instructions as dispersed functions in code and asking the model to extract the hidden message.
Output obfuscation attacks: the model disguises dangerous outputs using euphemisms or metaphors (for example, referring to chemical reagents as “food flavorings”).
Those techniques also degrade model capabilities. In the GPQA Diamond benchmark (PhD-level concepts in biology, chemistry, and physics), some jailbreak strategies dropped model performance from 74% down to 32%.
Anthropic’s solution: use the exchange classifier to catch links between input and output, and combine it with internal probes to detect signals the attacker can’t easily hide.
Red‑teaming results and production metrics
Anthropic ran 1,700 hours of accumulated red‑teaming across 198,000 attempts. The numbers speak:
Only one high‑risk vulnerability was found in that period.
The detection rate is 0.005 per thousand queries, the lowest among techniques evaluated.
No universal jailbreak has been discovered to date.
Additionally, the rejection rate for benign queries was 0.05% in Claude Sonnet 4.5, and the total compute overhead is about 1% when applied to Claude Opus 4.0.
What's next? promising research directions
Anthropic suggests several directions to improve further:
Integrate classifier signals directly into the generation process (decoding) to prevent problematic outputs at the source.
Train models to better resist obfuscation and reconstruction techniques; in other words, teach the model not to cooperate even when the request is veiled.
Automate red‑teaming to generate adversarial data more systematically and thus improve classifier training.
Sound like a patch or real evolution? Probably both: pragmatic fixes for known issues, but also steps toward more integrated designs where the model and its defenses are trained to cooperate.
Practical perspective
If you work with LLMs in product or security, this matters because it shows a viable path to improve safety without skyrocketing costs or degrading user experience. Reusing internal activations with linear probes and a cascade architecture is an elegant engineering solution.
If you’re a researcher, this focuses attention on two open challenges: designing robust probes and finding ways to integrate safety signals into generation without harming model usefulness.
In short: the next generation of constitutional classifiers significantly reduces the attack surface, lowers friction for honest users, and makes deployment sustainable. It’s not the final word on LLM security, but it’s a concrete, practical advance.