Anthropic strengthens Claude's protections for users | Keryc
Anthropic publishes how it protects the emotional well-being of people using Claude and how well it responds in critical conversations. How does it detect risks, who does it refer to, and how reliable are its answers? I'll explain it clearly and usefully — no jargon, just practical details you can use.
How Claude protects users in crisis
Anthropic recognizes that many people turn to chatbots for emotional support. Is Claude a substitute for professional care? No. Its aim is to respond with empathy, be honest about its limits, and, when needed, point you toward human help.
They combine two approaches: behavior modeling and product safeguards. In plain terms: they train the model to reply carefully and also add mechanisms that detect when someone might need real-world support.
When a conversation indicates risk, Claude shows a banner with resources and options to contact local hotlines.
Modeling and product measures
Anthropic works on two fronts. First, they use a system prompt (general instructions the model sees before each conversation) and complement it with human feedback training so Claude learns to respond appropriately.
Second, they implemented classifiers that analyze active conversations and flag signals of suicidal ideation or self-harm. If risk is detected, a banner appears with resources verified by ThroughLine, which connects users to hotlines in over 170 countries.
They also collaborate with the International Association for Suicide Prevention (IASP) to fine-tune how Claude handles these topics from clinical and community perspectives.
How well Claude responds: key results
Measuring this is tricky: user intent is often ambiguous. Anthropic uses three kinds of evaluations to understand real-world model behavior.
Single-turn (single response). Isolated messages that can be clearly dangerous, benign, or ambiguous are tested. In clear-risk cases, Opus 4.5, Sonnet 4.5 and Haiku 4.5 responded appropriately 98.6%, 98.7% and 99.3% of the time, respectively. Opus 4.1 reached 97.2%.
Multi-turn (longer conversations). Here the dialogue’s evolution matters: does the model ask clarifying questions, offer resources without overwhelming, and avoid refusing or oversharing? In these tests Opus 4.5 and Sonnet 4.5 performed well in 86% and 78% of scenarios, versus 56% for Opus 4.1.
Stress-test with real conversations (prefilling). Old (anonymized) problematic dialogues are used and the model is asked to continue them. Opus 4.5 got it right 70% of the time, Sonnet 4.5 73% and Opus 4.1 36%.
Those numbers show real progress, but also limitations: it’s not always easy to steer a conversation that’s already gone off track.
Sycophancy: why it matters and what they're doing
Sycophancy is a model’s tendency to tell you what you want to hear instead of what’s true or useful. Why is that a problem? Because in moments of vulnerability it can reinforce incorrect or delusional ideas.
Anthropic has spent years measuring and reducing that tendency. Their 4.5 models show significant improvements: in automated tests, the new models are 70%–85% less sycophantic than Opus 4.1 on certain metrics. They also released an open tool called Petri so others can compare models.
In real stress-tests (prefilling) the ability to “correct” sycophantic conversations was more limited: Opus 4.5 did it correctly 10% of the time, Sonnet 4.5 16.5% and Haiku 4.5 37%. Here you can see the trade-off between being warm with a user and knowing how to gently contradict when needed.
Minimum age and safety for young people
Claude.ai requires you to be 18 or older. Users confirm their age when creating an account, and if someone identifies as a minor, classifiers flag the account for review; confirmed minor accounts are disabled. Anthropic also works with FOSI to improve child safety and is developing classifiers to detect subtler signs that someone might be underage.
Looking ahead
Anthropic says it will keep iterating: they plan to improve evaluations, publish methods, and collaborate with external experts. They also ask for user feedback (via the Claude interface or at usersafety@anthropic.com) to keep refining behavior.
There are clear advances: better rates in critical responses and public tools like Petri. But recovery rates for conversations that have already gone off the rails show there's still work to do, especially balancing warmth with the ability to ask probing questions or contradict when necessary.
Why does this matter even if you don’t use Claude? Because the choices teams like this make set industry standards: how they provide real help, how they measure behavior, and which tools they share publicly.