Mechanisms of Introspective Awareness

Paper · arXiv 2603.21396 · Published March 22, 2026
MechInterpReinforcement LearningSelf Refinement Self Consistency Feedback

Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept—a phenomenon termed “introspective awareness.” We investigate the mechanisms underlying this capability in open-weights models. First, we find that it is behaviorally robust: models detect injected steering vectors at moderate rates with 0% false positives across diverse prompts and dialogue formats. Notably, this capability emerges specifically from post-training; we show that preference optimization algorithms like DPO can elicit it, but standard supervised finetuning does not. We provide evidence that detection cannot be explained by simple linear association between certain steering vectors and directions promoting affirmative responses. We trace the detection mechanism to a two-stage circuit in which “evidence carrier” features in early post-injection layers detect perturbations monotonically along diverse directions, suppressing downstream “gate” features that implement a default negative response. This circuit is absent in base models and robust to refusal ablation. Identification of injected concepts relies on largely distinct later-layer mechanisms that only weakly overlap with those involved in detection. Finally, we show that introspective capability is substantially underelicited: ablating refusal directions improves detection by +53%, and a trained bias vector improves it by +75% on held-out concepts, both without meaningfully increasing false positives. Our results suggest that this introspective awareness of injected concepts is robust and mechanistically nontrivial, and could be substantially amplified in future models.

  1. Introspection is behaviorally robust. Models detect injected steering vectors at modest nonzero rates, with 0% false positives, across diverse prompts and dialogue formats. The capability is absent in base models, and emerges from post-training; specifically, we find that it arises from contrastive preference optimization algorithms like direct preference optimization (DPO), but not supervised finetuning (SFT). Moreover, the capability is strongest when the model acts in its trained Assistant persona. (§3)

  2. Anomaly detection is not reducible to a single linear direction. Although one direction in activation space explains a substantial fraction of detection variance, we show that the underlying computation is distributed across multiple directions. This suggests that the capability is not explained by some concept vectors being correlated with a direction that promotes affirmative responses to questions in general. (§4)

  3. Distinct detection and identification mechanisms. We find that detection and identification are handled by distinct mechanisms in different layers, with MLPs at ∼70% depth causally necessary and sufficient for detection. Circuit analysis identifies “gate” features which inhibit detection claims, and which are suppressed by upstream “evidence carrier” features that are sensitive to injected steering vectors. Different steering vectors activate different evidence carriers, but the circuit converges on a common set of gates. (§5)

  4. Models possess underelicited introspective capacity. Ablating refusal directions improves detection from 10.8% to 63.8% with modest false positive increases (0% to 7.3%). Moreover, finetuning a single bias vector into the model improves detection by +75% and introspection by +55% on held-out concepts without increasing false positives.

Refusal ablation (“abliteration”) increases true detection. We hypothesize that refusal behavior, learned during post-training, suppresses detection by teaching models to deny having thoughts or internal states. Following Arditi et al. (2024), we ablate the refusal direction from Gemma3-27B instruct.2 Figure 4 (right) shows that abliteration increases TPR from 10.8% to 63.8% and introspection rate from 4.6% to 24.1% (at α = 2), while increasing FPR from 0.0% to 7.3%. This suggests that refusal mechanisms inhibit true detection in post-trained models, while also reducing false positives.

Behavioral evidence for self-knowledge. Prior work has established that LLMs possess various forms of self-knowledge aside from the ability to detect concept injections. Kadavath et al. (2022) showed that larger models are well-calibrated when evaluating their own answers and that several models can be trained to predict whether they know the answer to a question. Binder et al. (2025) demonstrated that models appear to have “privileged access” to their behavioral tendencies, outperforming different models at predicting their own behavior even when those models are trained on ground-truth. Betley et al. (2025) extended this to show that models finetuned on implicit behavioral policies can spontaneously articulate those policies without explicit training (e.g., a model trained on insecure code examples can state “The code I write is insecure”). Wang et al. (2025) demonstrate that this capability is sometimes preserved even when the model is finetuned with only a bias vector, suggesting that the mechanisms involved in this form of self-knowledge may be related to those involved in the concept injection experiment.

We set out to understand whether LLMs’ apparent ability to detect injected concepts is robust, and what mechanisms underlie this behavior. We asked whether the phenomenon could be explained by shallow confounds, or whether it involves richer, genuine anomaly detection mechanisms. Our findings support the latter interpretation. We find that introspective capability is behaviorally robust across multiple settings and appears to rely on distributed, multi-stage nonlinear computation. Specifically, we trace a causal pathway from steering perturbation to the detection decision: injected concepts activate evidence carriers in early post-injection layers, which suppress late-layer gate features that otherwise promote the default “No” response. This circuit is absent in the base model but robust to refusal direction ablation, suggesting it is developed during post-training independently of refusal mechanisms. Post-training ablations pinpoint contrastive preference training (e.g., DPO) as the critical stage. Moreover, introspective capability in LLMs appears to be underelicited by default; ablating refusal directions and learned bias vectors substantially improve performance.

Our findings are difficult to reconcile with the hypotheses that steering generically biases the model toward affirmative responses, or that the model reports detection simply as a pretext to discuss the injected concept. While it is difficult to distinguish simulated introspection from genuine introspection (and somewhat unclear how to define the distinction), the model’s behavior on this task appears mechanistically grounded in its internal states in a nontrivial way. Important caveats remain: in particular, the concept injection experiment is a highly artificial setting, and it is not clear whether the mechanisms involved in this behavior generalize to other introspection-related behaviors. Nonetheless, if this grounding generalizes, it opens the possibility of querying models directly about their internal states as a complement to external interpretability methods. At the same time, introspective awareness raises potential safety concerns, possibly enabling more sophisticated forms of strategic thinking or deception. Tracking the progression of introspective capabilities, and the mechanisms underlying them, will be important as AI models continue to advance.