Why do models produce less uncertain outputs on their own text?
Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.
The cleanest evidence that post-trained models recognize their own generations is an entropy gap: on-policy output distribution entropy is 3-4x lower than off-policy entropy, and this holds across model families and size classes. When a model continues its own trajectory it is far more confident than when it continues a context it did not produce. The recognition is not verbalized — it is implicitly encoded in the shape of the output distribution itself.
The mechanism the paper traces is an internal representation of input surprise: the model tracks how unlikely the most recent input token was relative to its own prior predictions, and this surprise signal causally modulates output entropy. A vivid instance appears with open-ended prompts. Post-trained models (unlike pretrained ones) collapse their uncertainty over the topic of the upcoming response before the first output token — they cache an intention. Violating that cached intention by prefilling a different-topic continuation drives output entropy back up, exposing the mismatch between the model's plan and the imposed context.
Why it matters: this connects to a broader picture of entropy as a controllable, mechanistically grounded variable rather than a side effect. It also has a practical edge for detection — the entropy signature is a behavioral fingerprint of on-policy versus off-policy context that does not require access to weights. But the counterpoint is sharp: an implicit signal that lowers entropy on self-generated text means models may grow systematically overconfident precisely on the outputs they author, which is the regime where their errors compound autoregressively.
— "From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations", https://arxiv.org/abs/2605.25459
Related concepts in this collection
-
Do models recognize their own outputs as actions shaping future inputs?
Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.
the entropy gap is the implicit signature of the enaction shift
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
adds a third entropy regime — on-policy vs off-policy recognition — distinct from training collapse and test-time inflation
-
Does training order reshape how models handle different task types?
Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.
both treat output entropy as a mechanistic variable shaped by what the model is processing
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends the entropy-as-lever picture into training: where this note finds self-recognition lowers entropy on-policy, that note shows entropy collapse is the binding constraint when RL optimizes those same on-policy generations
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
on-policy output entropy is three to four times lower than off-policy because models track input surprise