Reasoning and Learning Architectures AI Social Psychology

Why do models produce less uncertain outputs on their own text?

Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.

Note · 2026-05-28 · sourced from MechInterp

The cleanest evidence that post-trained models recognize their own generations is an entropy gap: on-policy output distribution entropy is 3-4x lower than off-policy entropy, and this holds across model families and size classes. When a model continues its own trajectory it is far more confident than when it continues a context it did not produce. The recognition is not verbalized — it is implicitly encoded in the shape of the output distribution itself.

The mechanism the paper traces is an internal representation of input surprise: the model tracks how unlikely the most recent input token was relative to its own prior predictions, and this surprise signal causally modulates output entropy. A vivid instance appears with open-ended prompts. Post-trained models (unlike pretrained ones) collapse their uncertainty over the topic of the upcoming response before the first output token — they cache an intention. Violating that cached intention by prefilling a different-topic continuation drives output entropy back up, exposing the mismatch between the model's plan and the imposed context.

Why it matters: this connects to a broader picture of entropy as a controllable, mechanistically grounded variable rather than a side effect. It also has a practical edge for detection — the entropy signature is a behavioral fingerprint of on-policy versus off-policy context that does not require access to weights. But the counterpoint is sharp: an implicit signal that lowers entropy on self-generated text means models may grow systematically overconfident precisely on the outputs they author, which is the regime where their errors compound autoregressively.


— "From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations", https://arxiv.org/abs/2605.25459

Related concepts in this collection

Concept map
12 direct connections · 122 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

on-policy output entropy is three to four times lower than off-policy because models track input surprise