Do explicit and implicit self-recognition use the same mechanism?
Language models show two forms of self-recognition: implicit entropy shifts and explicit verbal reports. Do these tap the same underlying internal state, or do they operate through separate mechanisms?
Self-recognition in language models is not one capability but two. The implicit form shows up in the output distribution — on-policy entropy collapses relative to off-policy, driven by an internal input-surprise signal the model never has to articulate. The explicit form is a verbal report: ask the model whether the preceding context is its own generation or a prefill, and it can answer. The striking finding is that these two abilities do not share a substrate. Explicit verbal self-recognition routes through a different mechanism than implicit recognition.
This dissociation is consequential for interpretability and self-report reliability. It means a model's verbal claim about whether it produced something is not simply reading off the same internal state that already governs its entropy. The two channels can in principle diverge: a model might verbally deny authorship while its distribution betrays on-policy confidence, or vice versa. The clean alignment between "what the model says about itself" and "what the model's internals encode" cannot be assumed.
Why it matters: much of the case for using model self-reports as a window into internal states presumes those reports tap the relevant mechanism. This result is a concrete counterexample within a single capability — self-recognition — where the verbal channel and the implicit channel are mechanistically separate. It tempers optimism about introspective self-report as a transparency tool: even when a model is right about itself, it may be right through a path disconnected from the process actually doing the work, which is exactly the failure mode that makes self-reports hard to trust as evidence.
— "From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations", https://arxiv.org/abs/2605.25459
Related concepts in this collection
-
Why do models produce less uncertain outputs on their own text?
Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.
the implicit channel whose mechanism the explicit verbal channel does not share
-
Can language models actually introspect about their own states?
Do LLM self-reports reveal genuine access to their internal processes, or do they merely echo patterns from training data? Understanding when self-reports reflect actual causal linkage to internal states matters for trusting model explanations.
the explicit-implicit dissociation is a within-capability instance of self-reports not always tracking the causal internal state
-
Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
both probe the boundary between behavioral capability and genuine introspective access
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
explicit verbal self-recognition routes through a different mechanism than implicit recognition