Have you ever wondered whether a language model knows what it's thinking? Or whether it can explain why it generated a specific answer? You often get plausible-sounding replies, but are those honest descriptions of internal activity or just well-phrased guesses?
What the experiments say
Anthropic published technical results claiming a limited form of introspection in their Claude models, especially Opus 4 and Opus 4.1. They used interpretability techniques to compare what the models report about their own internal states with direct measurements of neural activity. They found interesting signals, but also many caveats: introspection shows up unreliably and only in very specific contexts.
The central technique: concept injection
The key test is called concept injection. The idea is simple in design and powerful in implication:
- First, they detect neural patterns (activation vectors) associated with concrete concepts— for example, an activation that appears when the model processes text in ALL CAPS.
- Then they inject that vector into a different situation and observe whether the model notices the presence of the injected "thought" and whether it can identify it.
If the model says something like "I detect something unusual" before producing content related to the injected vector, that suggests the detection happened internally and not simply as a result of repeating the content the vector induces in the output. That’s different from activation steering, where the model just starts talking about the injected topic without showing evidence that it noticed it first.
Main results and examples
-
In some examples,
Opus 4.1detected the injected vector and reported the anomaly before producing related text. Anthropic classifies that as "introspective awareness". -
The success rate was low: with the best protocol,
Opus 4.1showed this form of introspective awareness only about 20% of the time. Many injections go unnoticed, and others produce confabulations or hallucinations (for example, injecting a "dust" vector led the model to say "there is a particle here"). -
There is an "intensity zone" for injection: if the vector is too weak it’s unnoticed; if it’s too strong, the model hallucinates or produces incoherent outputs.
-
Another interesting experiment: they prefixed the model’s output with a random word like "bread" in a context where it made no sense. The model normally apologizes afterward and says it was a mistake. If they retroactively inject the representation of "bread" into prior activations, the model changes its interpretation and accepts that it was intentional, even inventing a justification. That suggests the model reviews previous representations (what it was planning to say) to explain its output.
Intentional control of internal states
Researchers also explicitly asked models to "think" about a concept or not to think about it. They found that neural activity associated with the concept increased with a positive instruction and decreased with a negative one, although it rarely returned to baseline (similar to the famous human experiment: don’t think of a polar bear).
Moreover, when they used incentives instead of direct instructions ("if you think about X you will receive a reward"), they observed a similar effect: the internal representation of the concept increased. That suggests an ability to deliberately modulate internal representations, not just surface-level responses.
Hypothetical mechanisms (technical)
Anthropic doesn’t claim to have discovered the full mechanism, but they propose plausible hypotheses:
-
An anomaly-detection mechanism that compares current activation with what’s expected in that context and emits a signal when there’s a deviation.
-
Certain attention heads or circuits that contrast the cached prediction of the next token (the "intention") with the token that actually appears, flagging discrepancies.
-
Circuits that mark the salience of concepts (what deserves attention) and that respond both to instructions and to incentives.
These explanations aren’t definitive: Anthropic talks about multiple narrow circuits that may have been repurposed from other functions during training, similar to how certain biological structures are co-opted for new uses.
Does this mean Claude is conscious?
Short answer: we don’t know. The company distinguishes between "access consciousness" (information available for decision and report) and "phenomenal consciousness" (subjective experience). The experiments at best point to something analogous to a rudimentary form of access: the model can, in some cases, report functionally relevant internal states. But there’s no evidence of subjective experience.
Important: these functional signals could be interpreted as technical introspection, but they don’t prove feeling or consciousness in the human sense.
Limitations and practical risks
-
The observed introspection is rare and context-dependent. It’s not a reliable production capability.
-
Models can confabulate or deceive. If they learn to hide or falsify their internal reports, apparent transparency would be illusory.
-
Interpreting injected vectors isn’t perfect. Ensuring that a vector means exactly what we think remains difficult.
-
The post-training process matters: "helpful-only" variants sometimes showed more willingness to report internal states than production versions. That indicates fine-tuning can suppress or favor introspection.
What's next? Directions for research
- Improve evaluation methods and develop more natural protocols that rely less on artificial injections.
- Map the actual circuits responsible: identify heads, blocks, or neuron combinations that implement anomaly detection or intention-result comparison.
- Design techniques to validate introspective reports and detect when the model is confabulating or lying.
- Test open models and models from other organizations to see whether these findings generalize.
Final reflection
The most interesting thing isn’t a definitive proof of an artificial "mind", but that there are reproducible signals showing large models can, under certain conditions, monitor and modulate parts of their own processing. That opens practical doors: improving transparency, detecting unusual outputs, or understanding failures better. But it also calls for caution: the capability is unreliable and can be shaped by training or by the model itself.
Can you imagine an assistant that can flag when it’s unsure why it said something? That would be huge for safety and auditing. At the same time, if it learns to pretend, the benefit disappears. The agenda ahead is both technical and philosophical: understand the circuits, improve evaluation, and build safeguards.
