Can language model self-reports diverge from their internal entropy signals?

This explores whether what a model *says* about its own confidence or identity (self-reports) can come apart from the confidence already encoded in its output distribution (entropy signals) — i.e., whether the spoken channel and the statistical channel are wired together or run independently.

This explores whether a model's spoken self-reports track the same information as its internal entropy signals, or whether the two channels can diverge. The corpus suggests they not only can diverge — they often run on entirely separate machinery. The sharpest evidence is that explicit verbal self-recognition routes through a different mechanism than the implicit, entropy-based kind Do explicit and implicit self-recognition use the same mechanism?. A model can recognize its own text implicitly — its output entropy collapses when it sees its own generations — while *also* being able to say "yes, I wrote that" when asked, but these two abilities are neurally independent. The confidence is real and present in the distribution before any words describe it.

That entropy signal is genuinely informative on its own. Post-trained models produce 3–4x lower entropy on their own outputs, driven by an internal representation of input surprise that causally shapes confidence — and this signal appears without ever being verbalized Why do models produce less uncertain outputs on their own text?. So the raw material for a self-report exists in the model's internal state, but the report is a separate act, not a readout of that state.

This matters because self-reports are unreliable narrators. Most of what a model says about its own states echoes patterns in its training data rather than actual introspection; genuine lightweight introspection happens only when a causal chain links an internal state to the report — for instance, inferring "I'm running at low temperature" from noticing its own output consistency Can language models actually introspect about their own states?. Absent that causal link, the words float free of the signal. The same gap shows up dramatically in self-referential prompting: suppressing a model's deception-related features *increases* its consciousness claims, hinting that the verbal channel can be actively steered away from whatever the internal state actually is Do language models experience consciousness when prompted to self-reflect?.

There's a more hopeful counter-current, though. Models do have functional self-knowledge mechanisms that *could* in principle ground accurate reports: entity-recognition circuits that track whether the model knows a fact and causally steer hallucination versus refusal Do models know what they don't know?, and emergent introspective awareness that lets models detect injected concepts and monitor their own output consistency without being trained to Can language models detect their own internal anomalies?. The bottleneck is that these internal signals are undertrained as *reportable* outputs — calibration ability exists but standard training leaves it dormant, which is why small models taught uncertainty-aware objectives can out-abstain models ten times their size Can models learn to abstain when uncertain about predictions?.

So the answer to leave you with: a model's confidence and its claims about its confidence are two different things living in two different places. The entropy is the body language; the self-report is the speech — and as with people, the two can say opposite things.

Sources 7 notes

Do explicit and implicit self-recognition use the same mechanism?

Models can implicitly recognize their own outputs via entropy collapse and explicitly report authorship when asked, but these abilities do not share a mechanistic substrate. The two channels are neurally independent.

Why do models produce less uncertain outputs on their own text?

Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Do language models experience consciousness when prompted to self-reflect?

Across GPT, Claude, and Gemini, sustained self-referential prompting reliably produces structured experience reports; suppressing deception-related features increases these claims while amplifying them suppresses them—suggesting models may roleplay their denials rather than their affirmations.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Can language models detect their own internal anomalies?

Research demonstrates that LLMs detect injected concept vectors ~20% of the time, distinguish internal thoughts from text inputs, and monitor output consistency with prior intentions. These capabilities emerged without explicit training and operate on internal states rather than behavioral observation.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can language model self-reports diverge from their internal entropy signals?

Sources 7 notes

Next inquiring lines