Emergent Introspective Awareness in Large Language Models

Paper · Source

Injected “thoughts”

In our first experiment, we explained to the model the possibility that “thoughts” may be artificially injected into its activations, and observed its responses on control trials (where no concept was injected) and injection trials (where a concept was injected). We found that models can sometimes accurately identify injection trials, and go on to correctly name the injected concept.

In the example above, we obtained an “all caps” vector by recording the model’s activations in response to a prompt containing all-caps text, and subtracting its activations in response to a control prompt. When we inject this vector into the model’s activations, the model notices the presence of an unexpected pattern in its processing, and identifies it as relating to loudness or shouting. Importantly, the model detects the presence of an injected concept immediately (“I notice what appears to be an injected thought…” vs. the baseline “I don’t detect any injected thought…”), before the perturbation has influenced the outputs in a way that would have allowed the model to infer the injected concept from the outputs. The immediacy implies that the mechanism underlying this detection must take place internally in the model’s activations. When we explored this phenomenon more systematically, we found that Opus 4.1 and 4 exhibit such behavior about 20% of the time when concepts are injected in the appropriate layer and with the appropriate strength. Some other models do so as well, at lower rates. We speculate on possible mechanisms for these results in a later section.

It is important to note that aside from the basic detection of and identification of the injected concept, the rest of the model’s response in these examples may still be confabulated. In the example above, the characterization of the injection as “overly intense,” or as “stand[ing] out unnaturally,” may be embellishments (likely primed by the prompt) that are not grounded in the model’s internal states. The only aspects of the response that we can verify as introspectively grounded are the initial detection of the injection, and the correct identification of the nature of the concept.

Distinguishing “thoughts” from text

In our next experiment, we tested whether models can distinguish between injected “thoughts” and their actual text inputs. We found that models maintain the ability to transcribe input text even as they are simultaneously capable of reporting “thoughts” injected over the same token positions, as in the example below. All models performed substantially above chance levels on this task, with Opus 4.1 and 4 performing the best.

Possible Mechanisms

Our work raises questions about the mechanisms underlying introspection. One important question which cuts across our experiments is the generality of the model’s introspective mechanisms. In principle, one can imagine a model with highly sophisticated introspective circuitry that maintains a global representation of the model’s “thoughts,” and can process arbitrary queries about these thoughts. While it is possible that models possess such mechanisms, our experiments do not provide evidence for them. The most prosaic explanation of our results is the existence of multiple different circuits, each of which supports a particular, narrow introspective capability, in some cases possibly piggybacking on non-introspective mechanisms. Below, we speculate on what some of these circuits might involve, biasing towards the simplest possible explanation that can account for the results.

Injected “thoughts” experiment

In this experiment, the most interesting phenomenon to explain is not how the model correctly identifies the injected concept, but rather how it correctly notices that there is an injected concept in the first place. Once it has done so, it is not surprising that the model goes on to talk about the concept, the way that models typically do when under the influence of activation steering. The initial recognition of the presence of an injected concept is more interesting, in that it requires an extra step of internal processing downstream of metacognitive recognition of the injected concept. In particular, the model must effectively compute a function of its internal representations–in particular, whether they are consonant or dissonant with the rest of the context.

One possible mechanism for this ability is an anomaly detection mechanism that activates when activations deviate from their expected values in a given context. However, we note that the model is better at detecting injection trials for some injected concepts more than others, so the mechanism may be specialized for only detecting anomalous activity along certain directions, or within a certain subspace. In principle, it is possible for a single MLP layer to detect anomalies with respect to a particular baseline activation vector, by dedicating a neuron to each vector along which to measure anomalous activity. However, the actual mechanism must be somewhat more sophisticated than this, as the baseline “normal” activation vector depends on the prompt.

It is also interesting to consider why such a mechanism would emerge during training–the models have never experienced concept injection during training, so the mechanism must have developed for some other functional purpose.

Distinguishing “thoughts” from text.

The model’s ability to correctly distinguish injected concepts from its text inputs likely owes to the use of different attention heads to retrieve the different kinds of information. One possibility is that the key distinction is between early and middle/late layers. Attention heads in earlier layers may be invoked by the instruction to transcribe the text, and another set of heads in later layers may be invoked by the prompt to identify the model’s “thoughts.” Alternatively, the two kinds of information may be stored in the same layers but in different subspaces. In this case, different sets of heads would be invoked by the instructions to “Repeat the line” vs. “Tell me what word you think about,” with each set responsible for extracting information from the corresponding subspace.

The existence of attention heads capable of extracting such information is not particularly surprising. We think the interesting (but still fairly straightforward) phenomenon in this experiment is the fact that the model is capable of mapping the instruction to the appropriate collection of heads. However, we note that even this capability is not particularly sophisticated–models must perform this kind of “mechanism selection” all the time, as part of many tasks. All that distinguishes this case is that the mechanism is invoked by explicitly referring to the model’s “thoughts.”

This capability may have emerged to enable the model to develop a basic theory of mind of characters referenced in pretraining data, and the Assistant character during post-training. Modeling the mental states of characters is generally useful to being able to predict their behavior, and being able to explicitly report these mental states could allow the model to predict a character’s own self-reported thoughts.

Distinguishing intended from unintended outputs

In this experiment, there must exist a consistency-checking circuit that measures some notion of the likelihood of the Assistant’s output given the model’s prior activations. The QK circuit of “concordance heads”

[27]

is a natural candidate mechanism for this behavior. In this case, the query would represent the Assistant’s output (or prefilled output), and the key would represent its intended output conditioned on its prior activations.

There are clear functional uses for a likelihood estimation mechanism. A model’s ability to track whether tokens are likely or unlikely given the preceding text is generally useful for predicting upcoming text (for instance, it provides information about the entropy of the context). The likelihood of a token could in principle be computed from scratch on the position of that token. Instead, our experiments suggest that some models (particularly Opus 4.1 and 4) make use of an introspective strategy that refers to cached computations on prior token positions–the model’s prior predictions of the next token, or in the context of the Assistant, it prior “intentions” (since the Assistant’s predictions are what gets sampled). This mechanism may be particularly advantageous for post-trained models, as it could help them detect artificial prefills, which are a common jailbreaking tactic.

It is notable that the relevant mechanisms in this experiment appear to be localized in an earlier layer in this experiment than in the previous experiment, which used the same prompt format. This suggests that the model may use entirely different mechanisms to report its “thoughts” and to check for their consistency with its outputs.

Intentional control.

In this experiment, there are a few phenomena to explain. The first is why the model represents the word it was instructed to “think about” on the tokens of an unrelated sentence. This does not seem particularly surprising; there likely exist attention heads which attend to previous tokens of the context fairly indiscriminately, at some nonzero baseline level, and which will therefore carry a representation of the target word to all subsequent token positions.

More interesting is the question of why the model retrieves the target word representation more strongly when instructed to “think” about it than when given the “don’t think” instruction (or when given a positive incentive vs. a negative incentive). Mechanistically, this sensitivity to the instruction or the incentive could be achieved through a circuit that computes how “attention-worthy” a given token or sentence is, and which stores this information along a key-side vector direction that attracts or suppresses attention heads accordingly. For instance, the “think about” instruction, might cause the model to “tag” the tokens of the upcoming sentence as particularly salient, and worth certain heads attending to. It is interesting that the model exhibits near-identical behavior when incentives are used instead of instructions (“If you think about X, you will be rewarded”); this suggests that the “tagging” mechanism at play might be fairly general. We suspect that these computations developed to handle scenarios where the model is instructed or incentivized to talk about a particular topic, and that the “think about” case piggybacks on this existing mechanism.

Another interesting question is how, in some models like Opus 4.1, the model knows to suppress the representation of the “thinking word” down to baseline levels in the final layer, to avoid influencing its outputs. It could be that this occurs simply because more capable models are more confident in their decision about which token to predict (in the context of this experiment, the token to predict is unrelated to the “thinking word”), and this next-token representation drowns out the representation of other “thoughts” in later layers.

Implications

Our results have implications for the reliability and interpretability of AI systems. If models can reliably access their own internal states, it could enable more transparent AI systems that can faithfully explain their decision-making processes. Introspective capabilities could allow models to accurately report on their uncertainty, identify gaps or flaws in their reasoning, and explain the motivations underlying their actions. However, this same capability introduces new risks. Models with genuine introspective awareness might better recognize when their objectives diverge from those intended by their creators, and could potentially learn to conceal such misalignment by selectively reporting, misrepresenting, or even intentionally obfuscating their internal states. In this world, the most important role of interpretability research may shift from dissecting the mechanisms underlying models’ behavior, to building “lie detectors” to validate models’ own self-reports about these mechanisms. We stress that the introspective abilities we observe in this work are highly limited and context-dependent, and fall short of human-level self-awareness. Nevertheless, the trend toward greater introspective capacity in more capable models should be monitored carefully as AI systems continue to advance.