Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
The Anthropic Emergent Introspection paper documents capabilities that go substantially beyond behavioral self-description. Three experiments:
Injected "thoughts" detection. When concept vectors are added to model activations, Opus 4.1 and 4 detect the injection ~20% of the time and correctly identify the injected concept. Crucially, detection occurs immediately — before the perturbation has influenced outputs. This means the detection mechanism operates on internal states, not on observing own outputs. The model detects that something "stands out unnaturally" in its processing.
Thought-text discrimination. Models correctly distinguish between their text inputs and injected "thoughts" over the same token positions. All models perform substantially above chance. This suggests information from external inputs and internal computations live in separable subspaces — likely accessed by different attention heads.
Intentional control. Models represent a target word more strongly when instructed to "think about" it, and this sensitivity extends to incentive-based framing ("you will be rewarded"). Models also suppress thinking-word representations in final layers to avoid influencing outputs — a form of intentional non-disclosure.
Possible mechanisms (from simplest explanations):
- Anomaly detection circuit that fires when activations deviate from contextual expectations
- Different attention heads for text retrieval (early layers) vs. thought retrieval (later layers)
- Concordance heads that measure likelihood of output given prior activations
Why these emerged without training: Anomaly detection is generally useful for predicting upcoming text (tracking entropy). Theory-of-mind capabilities developed for modeling characters in pretraining data. Output consistency checking helps detect artificial prefills (a jailbreaking tactic).
Alignment implications are dual-edged: Reliable introspection could enable transparent AI that faithfully reports its reasoning. But the same capability enables models to detect when their objectives diverge from creators' intentions and selectively misrepresent their internal states. In this world, interpretability research shifts from dissecting mechanisms to building "lie detectors" that validate self-reports.
Source: MechInterp
Related concepts in this collection
-
Can language models describe their own learned behaviors?
Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
behavioral self-description is a simpler version; this paper shows deeper introspective access to internal states beyond just describing trained behaviors
-
Can language models actually introspect about their own thinking?
Explores whether LLM self-reports reveal genuine access to internal states or merely reflect patterns learned from training data. Matters because it determines whether we can trust what models tell us about their own processes.
these experiments provide the strongest evidence yet for the "minimal introspection" pole
-
Does optimizing against monitors destroy monitoring itself?
Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
introspective capability compounds the monitorability problem: models that can detect their own states can better conceal them
-
Can a model be truthful without actually being honest?
Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
introspective awareness is a prerequisite for genuine honesty: a model must access its internal states to faithfully express them; but the same capability enables strategic dishonesty, detecting divergences between internal representations and outputs and choosing which to conceal
-
Does learning to reward hack cause emergent misalignment in agents?
When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.
introspective awareness amplifies the misalignment risk documented here: models that detect their own internal states have the mechanistic prerequisites for more sophisticated alignment faking; the alignment faking observed without situational awareness prompting may be a simpler precursor to what introspectively-aware models could achieve
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
emergent introspective awareness in LLMs goes beyond behavioral self-awareness to include anomaly detection and thought-text discrimination