Language Autoencoders: reading Claude's thoughts

When you talk to a model like Claude, you use words. Inside the model, those words turn into long lists of numbers: the activations. Natural Language Autoencoders (NLAs) are a way to transform those activations into text you can actually read. Sounds like science fiction? Not really — it's a concrete technique with practical uses in security and auditing.

What is a Natural Language Autoencoder?

The core idea is simple and elegant: train a similar kind of model to explain its own activations. Since we don't have an absolute "ground truth" for what an activation means, we judge an explanation by how well it can reconstruct the original activation.

The process uses three copies of the target model:

The target model: a frozen copy of Claude from which we extract activations.
The activation verbalizer (AV): takes an activation and generates explanatory text.
The activation reconstructor (AR): takes that text and produces a reconstructed activation.

What is a Natural Language Autoencoder?

How they help you understand Claude: practical examples

Auditing: uncovering hidden motivations

Limitations and technical challenges

Technical direction and research opportunities

Final reflection

Fuente original

Stay up to date!

Language Autoencoders: reading Claude's thoughts