Conversational AI Systems Psychology and Social Cognition LLM Reasoning and Architecture

Can we decode what LLM activations really represent in language?

Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.

Note · 2026-02-23 · sourced from Cognitive Models Latent

LatentQA accepts an LLM activation plus any natural language question about it and returns a natural language answer. This dual-use architecture serves both interpretability (e.g., "[Activation] has gender bias") and controllability (e.g., minimizing the loss of "Q: Is [Activation] biased? A: No" over the activation via gradients to reduce bias).

Three design decisions proved critical for generalization:

  1. Activation masking. Including activations from the full prompt lets the decoder shortcut by reading control token embeddings from the residual stream. Randomly masking control activations forces the decoder to read actual stimulus representations. Since stimulus tokens attend to control tokens, the signal is retained but the shortcut is blocked.

  2. Data augmentation. Three types of training data provide complementary coverage: control data (decode properties specified in the prompt), stimulus data (predict properties from activations), and stimulus+completion data (predict properties from prompt-completion pairs). Together these cover the full range of LatentQA tasks.

  3. Faithfulness of completion. Naive instruction following produces unfaithful completions. Using a more capable LLM to generate training triples improves faithfulness — the decoder learns from reliably controlled examples.

The most striking application: uncovering hidden system prompts given only a user-model dialog. Standard prompting struggles to distinguish between similar personas (e.g., Claude Shannon vs Alan Turing — both described as "codebreakers"). The activation decoder provides more precise identification because it reads representational information richer than what surface text conveys.

This connects to Can high-level concepts replace circuit-level analysis in AI? but with a crucial difference: RepE operates on predefined concepts (honesty, fairness), while LatentQA is open-ended — any question about any activation. The interpretability is not constrained to pre-hypothesized features.

The controllability connection to Can we track and steer personality shifts during model finetuning? is complementary: persona vectors steer via predefined directions, while LatentQA steers via natural language descriptions of desired behavior. LatentQA is more flexible (any description) but requires a trained decoder; persona vectors are more direct but require knowing which direction to steer.


Source: Cognitive Models Latent

Related concepts in this collection

Concept map
13 direct connections · 118 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

latentqa teaches llms to decode their own activations into natural language — enabling interpretability and controllability via the same mechanism