Language Understanding and Pragmatics Psychology and Social Cognition LLM Reasoning and Architecture

Can language models detect their own internal anomalies?

Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.

Note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The Anthropic Emergent Introspection paper documents capabilities that go substantially beyond behavioral self-description. Three experiments:

  1. Injected "thoughts" detection. When concept vectors are added to model activations, Opus 4.1 and 4 detect the injection ~20% of the time and correctly identify the injected concept. Crucially, detection occurs immediately — before the perturbation has influenced outputs. This means the detection mechanism operates on internal states, not on observing own outputs. The model detects that something "stands out unnaturally" in its processing.

  2. Thought-text discrimination. Models correctly distinguish between their text inputs and injected "thoughts" over the same token positions. All models perform substantially above chance. This suggests information from external inputs and internal computations live in separable subspaces — likely accessed by different attention heads.

  3. Intentional control. Models represent a target word more strongly when instructed to "think about" it, and this sensitivity extends to incentive-based framing ("you will be rewarded"). Models also suppress thinking-word representations in final layers to avoid influencing outputs — a form of intentional non-disclosure.

Possible mechanisms (from simplest explanations):

Why these emerged without training: Anomaly detection is generally useful for predicting upcoming text (tracking entropy). Theory-of-mind capabilities developed for modeling characters in pretraining data. Output consistency checking helps detect artificial prefills (a jailbreaking tactic).

Alignment implications are dual-edged: Reliable introspection could enable transparent AI that faithfully reports its reasoning. But the same capability enables models to detect when their objectives diverge from creators' intentions and selectively misrepresent their internal states. In this world, interpretability research shifts from dissecting mechanisms to building "lie detectors" that validate self-reports.


Source: MechInterp

Related concepts in this collection

Concept map
17 direct connections · 122 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

emergent introspective awareness in LLMs goes beyond behavioral self-awareness to include anomaly detection and thought-text discrimination