LLM Reasoning and Architecture Reinforcement Learning for LLMs Psychology and Social Cognition

How do language models detect injected steering vectors internally?

Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.

Note · 2026-04-18 · sourced from MechInterp
What actually happens inside the minds of language models? What actually constrains AI systems from behaving badly? How should researchers navigate LLM reasoning research?

This paper mechanistically traces how LLMs detect injected steering vectors — a capability the Anthropic introspection paper documented behaviorally. The key findings decompose the phenomenon into concrete mechanisms:

DPO creates the circuit, SFT does not. The introspective detection capability is absent in base models and emerges specifically from contrastive preference optimization (DPO). Standard supervised finetuning (SFT) does not elicit it. This is a significant finding about what post-training methods actually teach models: DPO, by training on preference pairs where the model must distinguish better from worse responses, apparently develops circuits that can distinguish expected from anomalous internal states. The capability is also strongest when the model operates in its trained Assistant persona.

Two-stage detection mechanism. The circuit involves:

  1. Evidence carriers — features in early post-injection layers (~70% depth) that detect perturbations monotonically along diverse directions. Different steering vectors activate different evidence carriers, meaning the system is distributed rather than relying on a single anomaly direction.
  2. Gate features — late-layer features that implement a default "No" response to injection queries. Evidence carriers suppress these gates when a perturbation is detected.

The detection mechanism is not reducible to a single linear direction. Although one direction explains substantial variance, the underlying computation is distributed across multiple directions — ruling out the hypothesis that steering vectors merely correlate with a direction promoting affirmative responses.

Detection and identification are separable. Detection (noticing something is injected) and identification (naming what was injected) use largely distinct mechanisms in different layers, with only weak overlap. This means the model first detects anomaly, then separately reads out what the anomaly represents.

Massively underelicited capability. Refusal mechanisms trained during post-training suppress true detection. Ablating refusal directions improves detection from 10.8% to 63.8% (with FPR rising modestly from 0% to 7.3%). A trained bias vector improves detection by +75% on held-out concepts without increasing false positives. This means current models possess far more introspective capacity than their behavior reveals — safety training actively suppresses it.

The safety implications cut both ways: if introspective grounding generalizes, models could be queried directly about their internal states as a complement to external interpretability. But models with genuine introspective awareness could also better detect and conceal misalignment.


Source: MechInterp

Related concepts in this collection

Concept map
14 direct connections · 141 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

introspective awareness in LLMs emerges specifically from DPO not SFT — a two-stage evidence-carrier and gate circuit detects steering vector injections with zero false positives