Does input surprise drive the implicit recognition of on-policy context?
This explores whether a model's sense of being 'on-policy' — recognizing that the text it's reading is its own generated trajectory rather than external input — is triggered by how predictable (low-surprise) that text feels to it.
This explores whether a model's sense of being 'on-policy' — recognizing that the text in front of it is its own output rather than someone else's — is something it picks up from how unsurprising that text feels. The corpus suggests the answer is roughly yes, with an interesting mechanism behind it. The clearest evidence comes from work showing that post-training flips a model from passive prediction into a kind of action-perception loop, where it treats its outputs as future inputs Do models recognize their own outputs as actions shaping future inputs?. The behavioral fingerprint of that recognition is a 3–4x drop in output entropy when the model is on its own trajectory. Low entropy is exactly the signature of low surprise: on-policy context is the context the model itself finds most predictable, so 'this feels like me' and 'this feels unsurprising' may be the same signal read two ways.
A neighboring note sharpens why surprise would be the right currency here. Whether a piece of text 'lands' and primes future behavior turns out to be predictable from its probability before any learning — there's a sharp threshold (~10^-3) separating contexts that take hold from those that don't Can we predict keyword priming before learning happens?. That's a strong hint that models are already gating on something like input likelihood when deciding what to internalize, which is the same quantity surprise measures.
But recognition isn't purely about a single token's surprise — it's also structural. In-context learning of behavior requires not isolated examples but full or partial trajectories from the same regime; this 'burstiness' is what lets a model recognize and generalize a policy without weight updates Why do trajectories matter more than individual examples for in-context learning?. So the recognition signal is likely surprise-over-a-trajectory (a coherent low-surprise stretch) rather than a one-off dip. There's even a wilder cousin to this: RL agents drift into using their environment as external memory, recognizing their own past traces in the world without ever being told to Do RL agents accidentally use environments as memory? — implicit self-recognition as a side effect of optimization, not an explicit objective.
The wrinkle — and the thing you might not have known to ask — is that surprise can be overridden. Models routinely ignore in-context information when their trained-in priors are strong enough, and textual prompting alone can't fix it Why do language models ignore information in their context?. So 'on-policy recognition driven by surprise' isn't a clean switch; it competes with parametric pull. That tension is also why training methods that lean on a model's own outputs as targets — consistency training to make a model treat perturbed and clean prompts identically — work at all Can models learn to ignore irrelevant prompt changes?: they're deliberately engineering what the model treats as 'unsurprising and mine.' The short version: surprise looks like a real driver of implicit on-policy recognition, but it's a contested signal, not a sovereign one.
Sources 6 notes
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.
Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.