From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations
Language models are pretrained as passive predictors with no incentive to model the consequences of their own outputs. Post-training changes this: a model producing its own responses can benefit from recognizing that it is on-policy. We present evidence that post-trained models recognize their on-policy generations, and this recognition is implicitly encoded in their output distributions. In particular, on-policy output distribution entropy is 3–4× lower than off-policy entropy, across model families and size classes. We trace part of this effect to an internal representation of input surprise, tracking the unlikeliness of the most recent input token according to the model’s prior predictions, that causally modulates output entropy. One example of these phenomena can be observed in response to open-ended prompts; post-trained models (unlike pretrained models) collapse their uncertainty over the topic of their upcoming response before the first output token; violating this cached intention with a different-topic prefill results in higher output entropy. We also tested whether models can distinguish on-policy contexts from prefills via explicit verbal report. We find that they can, but that interestingly, this explicit recognition routes through a different mechanism than implicit recognition.
Language models are initially trained (during pretraining) as next-token predictors. The training aims to minimize the cross-entropy with respect to a fixed data distribution. The irreducible uncertainty of this predictive task bounds how confident the model can be in its predictions at any given time: since many continuations are plausible, the model must spread probability mass across them. Importantly, during pre-training, the model never sees the consequences of its outputs; there is no feedback loop from action to sensory input. The distribution it learns is one it cannot affect, and with no way to influence its future inputs, there is no incentive to model the consequences of its own actions or recognize its own generations. It remains a passive observer of the external distribution.
We might hypothesize that a post-trained model moves from simulation to something more like enaction: rather than holding a character at arm’s length while making predictions about it, an enacting agent embodies the character, recognizing that its internal states are determinative of future outputs and that those outputs are actions that will influence its own future inputs. We can predict several consequences for a model operating under this paradigm. The model should be able to recognize when it is acting, i.e., when its past trajectory is on-policy, and modulate its behavior in response. For instance, when acting, the model might benefit from maintaining more deterministic output distributions, in order to minimize the noise from auto-regressive sampling. We might also expect enacting agents to form more opinionated plans about their future outputs, even when there are multiple reasonable responses they could give.
Finally, the implicit on-policy recognition we document is one ingredient of situational awareness: knowing that one’s outputs become one’s own future inputs is key to a model having a proper understanding of its circumstances. Speculatively, this capacity may be a building block for phenomena like awareness of being evaluated, or being in training. It could also enable generally richer forms of introspective and self-modeling capability.