AI Social Psychology Reasoning and Learning Architectures

Do models recognize their own outputs as actions shaping future inputs?

Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.

Note · 2026-05-28 · sourced from MechInterp

A pretrained language model is a passive observer. Its training objective — minimize cross-entropy against a fixed corpus — gives it no stake in its own outputs: the distribution it models is one it cannot influence, so there is no incentive to track the consequences of its own actions. It simulates a character at arm's length. Post-training breaks this symmetry. Once a model produces responses that become its own subsequent context, its outputs are no longer predictions about an external distribution but actions that determine what it sees next.

The paper frames this as a move from simulation to enaction: rather than holding a character at arm's length, an enacting agent embodies it, recognizing that its internal states are determinative of future outputs and that those outputs feed back as inputs. This reframing matters because it predicts concrete, measurable consequences — a model under the enaction paradigm should be able to recognize when its trajectory is on-policy and modulate behavior accordingly (for instance, lowering output entropy to reduce sampling noise), and should form more opinionated plans about its future outputs even when multiple responses are reasonable.

Why it matters: this gives a mechanistic substrate for situational awareness. Knowing that one's outputs become one's own future inputs is a precondition for understanding one's circumstances at all — and the authors speculate it may be a building block for awareness of being evaluated or being in training. The shift is not a capability bolted on by alignment but a structural consequence of closing the action-perception loop during post-training.


— "From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations", https://arxiv.org/abs/2605.25459

Related concepts in this collection

Concept map
12 direct connections · 132 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

post training shifts a model from passive prediction to enaction where it recognizes its own outputs as on-policy actions