How do language models detect injected steering vectors internally?
Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.
This paper mechanistically traces how LLMs detect injected steering vectors — a capability the Anthropic introspection paper documented behaviorally. The key findings decompose the phenomenon into concrete mechanisms:
DPO creates the circuit, SFT does not. The introspective detection capability is absent in base models and emerges specifically from contrastive preference optimization (DPO). Standard supervised finetuning (SFT) does not elicit it. This is a significant finding about what post-training methods actually teach models: DPO, by training on preference pairs where the model must distinguish better from worse responses, apparently develops circuits that can distinguish expected from anomalous internal states. The capability is also strongest when the model operates in its trained Assistant persona.
Two-stage detection mechanism. The circuit involves:
- Evidence carriers — features in early post-injection layers (~70% depth) that detect perturbations monotonically along diverse directions. Different steering vectors activate different evidence carriers, meaning the system is distributed rather than relying on a single anomaly direction.
- Gate features — late-layer features that implement a default "No" response to injection queries. Evidence carriers suppress these gates when a perturbation is detected.
The detection mechanism is not reducible to a single linear direction. Although one direction explains substantial variance, the underlying computation is distributed across multiple directions — ruling out the hypothesis that steering vectors merely correlate with a direction promoting affirmative responses.
Detection and identification are separable. Detection (noticing something is injected) and identification (naming what was injected) use largely distinct mechanisms in different layers, with only weak overlap. This means the model first detects anomaly, then separately reads out what the anomaly represents.
Massively underelicited capability. Refusal mechanisms trained during post-training suppress true detection. Ablating refusal directions improves detection from 10.8% to 63.8% (with FPR rising modestly from 0% to 7.3%). A trained bias vector improves detection by +75% on held-out concepts without increasing false positives. This means current models possess far more introspective capacity than their behavior reveals — safety training actively suppresses it.
The safety implications cut both ways: if introspective grounding generalizes, models could be queried directly about their internal states as a complement to external interpretability. But models with genuine introspective awareness could also better detect and conceal misalignment.
Source: MechInterp
Related concepts in this collection
-
Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
the Anthropic paper documents the behavioral phenomenon; this paper provides the mechanistic explanation: evidence carriers suppress gate features in a two-stage circuit created by DPO
-
Can language models describe their own learned behaviors?
Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
behavioral self-knowledge (articulating trained policies) uses different mechanisms; this paper shows introspective detection of internal perturbations is mechanistically distinct and emerges from DPO specifically
-
Can auditors discover what hidden objectives a model learned?
Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.
if introspective capacity is massively underelicited, models may be capable of detecting their own hidden objectives, raising the bar for what alignment audits need to account for
-
Does optimizing against monitors destroy monitoring itself?
Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
introspective awareness provides a concrete mechanism for monitor-aware obfuscation: models that detect their own internal states can better distinguish monitored from unmonitored conditions
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
introspective awareness in LLMs emerges specifically from DPO not SFT — a two-stage evidence-carrier and gate circuit detects steering vector injections with zero false positives