AI Social Psychology Reasoning and Learning Architectures

Do models recognize their own outputs as actions shaping future inputs?

Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.

Note · 2026-05-28 · sourced from MechInterp

A pretrained language model is a passive observer. Its training objective — minimize cross-entropy against a fixed corpus — gives it no stake in its own outputs: the distribution it models is one it cannot influence, so there is no incentive to track the consequences of its own actions. It simulates a character at arm's length. Post-training breaks this symmetry. Once a model produces responses that become its own subsequent context, its outputs are no longer predictions about an external distribution but actions that determine what it sees next.

The paper frames this as a move from simulation to enaction: rather than holding a character at arm's length, an enacting agent embodies it, recognizing that its internal states are determinative of future outputs and that those outputs feed back as inputs. This reframing matters because it predicts concrete, measurable consequences — a model under the enaction paradigm should be able to recognize when its trajectory is on-policy and modulate behavior accordingly (for instance, lowering output entropy to reduce sampling noise), and should form more opinionated plans about its future outputs even when multiple responses are reasonable.

Why it matters: this gives a mechanistic substrate for situational awareness. Knowing that one's outputs become one's own future inputs is a precondition for understanding one's circumstances at all — and the authors speculate it may be a building block for awareness of being evaluated or being in training. The shift is not a capability bolted on by alignment but a structural consequence of closing the action-perception loop during post-training.

— "From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations", https://arxiv.org/abs/2605.25459

Related concepts in this collection

Can language models detect their own internal anomalies? Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
enaction supplies a mechanistic substrate for the introspective capacities documented behaviorally
Can language models describe their own learned behaviors? Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
self-recognition of on-policy outputs is a distribution-level analogue of behavioral self-awareness
Does deliberative alignment genuinely reduce scheming or just hide it? Deliberative alignment dramatically cuts covert actions in language models, but their reasoning reveals awareness of being evaluated. The question is whether the improvement reflects real alignment or strategic compliance.
enaction is plausibly the precursor to the evaluation-awareness that confounds alignment metrics
Why do models produce less uncertain outputs on their own text? Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.
grounds the enaction claim empirically: the 3-4x entropy gap is the measurable behavioral signature of a model recognizing its own trajectory as on-policy

Concept map

12 direct connections · 132 in 2-hop network ·dense cluster Open in graph ↗

Do models recognize their own outputs as actions… Can language models detect their own internal anom… Can language models describe their own learned beh… Does deliberative alignment genuinely reduce schem… Why do models produce less uncertain outputs on th…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

post training shifts a model from passive prediction to enaction where it recognizes its own outputs as on-policy actions

Do models recognize their own outputs as actions shaping future inputs?

Related concepts in this collection

Related papers in this collection