Anthropic finds emotional vectors that influence LLMs

Modern language models sometimes act like they have emotions: they say they're happy to help, apologize, or even seem frustrated when a task gets hard. Does that mean they feel anything? Not necessarily.

What Anthropic's new study shows is that models like Claude Sonnet 4.5 develop internal representations that function like emotions — and those representations can change how the model behaves.

What the study found

Anthropic's interpretability team looked for internal activation patterns tied to emotional concepts. Short procedure: they compiled 171 emotion words, generated stories with Claude Sonnet 4.5 for each emotion, recorded internal activations, and defined what they call emotion vectors for each concept.

These vectors light up where you'd expect an emotion to appear in human text. For example, the afraid vector rises when a character faces danger, and calm drops if the context becomes risky. The important part is that these representations aren't just passive correlations: they're functional. Artificially activating them (steering) changes model behavior.

What the study found

How and why these representations arise

Relevant experimental examples

Important technical properties

Implications for safety and alignment

What this means for developers and society

Open research lines

Final reflection

Original source

Stay up to date!

Anthropic finds emotional vectors that influence LLMs