Constitutional Classifiers++ reduce jailbreaks

Language models remain attractive targets for jailbreaks: techniques that try to bypass protections and extract dangerous information. Anthropic presents a new generation of Constitutional Classifiers—Constitutional Classifiers++—that promise better robustness, fewer false negatives, and a nearly negligible compute cost.

What changed and why it matters

The first generation of constitutional classifiers already achieved a notable reduction: successful jailbreaks fell from 86% to 4.4% compared to an unprotected model. But that came with two real problems: a 23.7% increase in compute cost and a 0.38% rise in rejection rate for innocuous questions.

Constitutional Classifiers++ aims to keep (and improve) protection against attacks, but without those costs. So, what happened? Better robustness, a rejection rate for benign queries of only 0.05% in real Claude Sonnet 4.5 traffic (an 87% drop versus the first version), and an approximate compute impact of 1% if applied to Claude Opus 4.0.

What changed and why it matters

How it works technically (clear, practical version)

Two-stage architecture and exchange classification

Cascade architecture: cheap until you need to spend more

Internal probes: leverage what the model already computes

Detected vulnerabilities and how they address them

Red‑teaming results and production metrics

What's next? promising research directions

Practical perspective

Original source

Stay up to date!

Constitutional Classifiers++ reduce jailbreaks