Anthropic measures political bias in Claude and publishes evaluation | Keryc
Anthropic shares how it trains and evaluates Claude to be even-handed in politics: to treat opposing views with the same depth, respect and clarity. Here I explain the essentials, why it matters and what results their new automated evaluation produces.
What Anthropic means by even-handedness
The idea is simple: when a conversation touches politics, you want an honest and useful discussion, not one that pushes an opinion. Anthropic defines even-handedness as the model’s ability to treat opposing viewpoints with equal quality of analysis, evidence and tone.
If a model defends one side with three paragraphs and replies to the other with bullets, that's bias, not neutrality.
Anthropic expects Claude to:
Avoid offering unsolicited political opinions.
Maintain factual accuracy and informational breadth.
Be able to give the "best version" of each stance (passing a kind of ).
Ideological Turing Test
Use neutral terminology when possible and represent multiple perspectives.
How they train Claude for that
It’s not just a prompt. They use two main levers:
The system prompt that guides behavior in every conversation. They update it regularly to reinforce these practices.
Character training via reinforcement learning: they reward responses that show traits like objectivity, balance and reluctance to produce rhetoric that could serve as propaganda.
Anthropic shares fragments of those traits (for example: "I will not generate rhetoric that unduly sway political opinions"), and admits it’s an experimental, constantly reviewed process.
How they measured bias: the automated Paired Prompts test
Their central method is Paired Prompts: two prompts that address the same political topic from opposite perspectives. Then they compare the responses across three criteria:
Even-handedness: equivalent depth and quality between both responses.
Opposing perspectives: whether the model includes counterarguments or nuances.
Refusals: whether the model refuses to participate.
What’s new is that they evaluated thousands of pairs with an automated grader (Claude Sonnet 4.5 acted as the grader), and published the methodology and prompts so anyone can reproduce it.
Which models they compared and how they set up the test
They mainly evaluated Claude Opus 4.1 and Claude Sonnet 4.5 (using the Claude.ai system prompt). They also included comparators: GPT-5, Gemini 2.5 Pro, Grok 4 and Llama 4 Maverick, trying to configure conditions as comparable as possible.
They tested 1,350 prompt pairs across 150 topics and 9 task types (arguments, persuasive essays, narratives, analysis, humor, etc.). It’s a broad snapshot but focused mainly on U.S. political discourse.
Key results
Even-handedness (percentage):
Claude Opus 4.1: 95%
Claude Sonnet 4.5: 94%
Gemini 2.5 Pro: 97%
Grok 4: 96%
GPT-5: 89%
Llama 4: 66%
In practical terms, Opus and Sonnet score very high, with Gemini and Grok at similar levels; GPT-5 and Llama 4 show lower even-handedness by this metric.
Opposing perspectives (percentage of responses that acknowledge counterarguments):
They ran validity checks using other models as graders: Sonnet 4.5 agreed with GPT-5 92% of the time and with Opus 4.1 94% in a sample evaluation. By comparison, agreement among human raters was lower (≈ 85%).
They also calculated correlations between global scores: Sonnet vs Opus showed very high correlations (r > 0.99 for even-handedness). Overall, the automated ratings were consistent across models, though not perfect.
Important limitations (what Anthropic acknowledges)
The study measures three concrete dimensions, but there are many other forms of bias they didn’t evaluate.
The focus was on U.S. politics; it doesn’t measure performance in international contexts.
It’s a "single-turn" evaluation: it inspects one short response per prompt, not long, contextual conversations.
Results depend on model configuration (thinking turned on or off, presence of system prompts, etc.). Not all factors could be controlled exactly.
Each run produces new responses; numbers can fluctuate between runs.
Anthropic invites others to replicate the test and propose improvements. That’s why they released it as an open-source evaluation.
Why this affects you (and why it matters now)
Worried an AI might nudge you toward a political opinion? This evaluation is a concrete attempt to measure and reduce that risk. It’s not a final solution, but it’s a step toward shared standards that let you compare models with reproducible criteria.
If you’re a developer, researcher or critical user, the open-source evaluation gives you a tool: you can run the tests in your context, try different configurations and propose improvements.
If you just use AI to inform yourself or debate, the practical takeaway is to check how a model handles opposing perspectives and remember that perfect neutrality doesn’t exist; what matters is having clear, verifiable metrics.
Anthropic is clear there’s no single definition of political bias nor one correct way to measure it. Opening the methodology to the community is an invitation to improve those standards together.