AI Social Psychology Language Understanding and Reasoning

Do explicit and implicit self-recognition use the same mechanism?

Language models show two forms of self-recognition: implicit entropy shifts and explicit verbal reports. Do these tap the same underlying internal state, or do they operate through separate mechanisms?

Note · 2026-05-28 · sourced from MechInterp

Self-recognition in language models is not one capability but two. The implicit form shows up in the output distribution — on-policy entropy collapses relative to off-policy, driven by an internal input-surprise signal the model never has to articulate. The explicit form is a verbal report: ask the model whether the preceding context is its own generation or a prefill, and it can answer. The striking finding is that these two abilities do not share a substrate. Explicit verbal self-recognition routes through a different mechanism than implicit recognition.

This dissociation is consequential for interpretability and self-report reliability. It means a model's verbal claim about whether it produced something is not simply reading off the same internal state that already governs its entropy. The two channels can in principle diverge: a model might verbally deny authorship while its distribution betrays on-policy confidence, or vice versa. The clean alignment between "what the model says about itself" and "what the model's internals encode" cannot be assumed.

Why it matters: much of the case for using model self-reports as a window into internal states presumes those reports tap the relevant mechanism. This result is a concrete counterexample within a single capability — self-recognition — where the verbal channel and the implicit channel are mechanistically separate. It tempers optimism about introspective self-report as a transparency tool: even when a model is right about itself, it may be right through a path disconnected from the process actually doing the work, which is exactly the failure mode that makes self-reports hard to trust as evidence.


— "From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations", https://arxiv.org/abs/2605.25459

Related concepts in this collection

Concept map
12 direct connections · 84 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

explicit verbal self-recognition routes through a different mechanism than implicit recognition