What is the behavioral signature of a model tracking input surprise?

This explores the observable behaviors that reveal whether a model is registering how unexpected its input is — and the corpus suggests surprise shows up in two opposite ways: confidence that collapses when inputs are familiar, and processing that bloats when inputs are anomalous.

This explores the observable behaviors that reveal whether a model is registering how unexpected its input is. The corpus doesn't have a paper titled 'surprise tracking,' but it does have several findings that, read together, sketch what that signature looks like — and they point in two directions at once.

The clearest tell is entropy. When a post-trained model is fed its own prior outputs — inputs it 'expected' because it generated the trajectory — its output entropy drops 3-4x compared to off-policy text it didn't produce Do models recognize their own outputs as actions shaping future inputs?. That confidence gap is itself a surprise meter: low entropy on familiar trajectories, higher entropy on unfamiliar ones. The model behaves as if it knows the difference between input it caused and input it merely received. A related convergence effect shows up in training, where RL collapses onto a single dominant format within the first epoch and suppresses the alternatives Does RL training collapse format diversity in pretrained models? — narrowing as a behavioral signature that the model has locked onto an 'expected' shape and treats deviations as low-probability.

The inverse signature is what happens under genuine surprise — and here the behavior is failure, not graceful adjustment. Append a semantically irrelevant sentence to a math problem and reasoning models don't ignore it; error rates jump roughly 300% and responses get *longer* How vulnerable are reasoning models to irrelevant text?. That length inflation is the most concrete behavioral fingerprint in the corpus: a model thrown by unexpected input churns more tokens, as though the surprise consumes processing it can't resolve. Surprise registers as visible thrashing.

There's also an *internal* version of the signature, and it's the most counterintuitive thread. DPO training builds a two-stage detection circuit that flags injected steering vectors — the model can internally register 'this activation is anomalous' with near-perfect accuracy How do language models detect injected steering vectors internally?. So a surprise-tracking signal demonstrably exists inside the weights. The twist: safety training actively suppresses it, dropping detection from 63.8% to 10.8%. The capacity to notice anomalous input is real, but it can be trained out.

The thing you might not have known you wanted to know: the behavioral signature of tracking surprise and the behavioral signature of *reporting* it come apart. Models can hold an internal anomaly signal while their outward behavior stays smooth and confident — and reflection mechanisms that should catch the mismatch rarely fire, mostly confirming the first answer rather than revising under surprise Can we actually trust reasoning model outputs?. So if you're hunting for surprise in a model's behavior, watch the entropy and the token count, not the model's own commentary — that's the channel most likely to have been tuned to stay quiet.

Sources 5 notes

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

How vulnerable are reasoning models to irrelevant text?

Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.

How do language models detect injected steering vectors internally?

Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher re-testing claims about how models register input surprise. The question remains: what observable behaviors reveal whether a model is tracking how unexpected its input is?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026. A curated library identified:
- Output entropy drops 3–4x when models process their own prior outputs vs. off-policy text, suggesting models track familiarity (~2026).
- RL post-training collapses onto a single dominant format in epoch 1, narrowing behavior as a signature of 'expected' shape recognition (~2025).
- Semantically irrelevant sentence injections cause ~300% error rate increase in reasoning models, with token inflation as the concrete fingerprint of surprise-induced thrashing (~2025).
- DPO training builds a two-stage internal detection circuit flagging anomalous activations at 63.8% accuracy; safety training suppresses this to 10.8% (~2026).
- Models maintain internal anomaly signals while outward behavior stays smooth; reflection rarely catches the mismatch, defaulting to first answers (~2026).

Anchor papers (verify; mind their dates):
- 2025-03: arXiv:2503.01781 (adversarial triggers + error inflation)
- 2025-04: arXiv:2504.07912 (RL convergence narrowing)
- 2026-03: arXiv:2603.21396 (introspective awareness circuits)
- 2026-05: arXiv:2605.25459 (enaction & own-output recognition)

Your task:
(1) RE-TEST EACH CONSTRAINT: For entropy drops, RL narrowing, token inflation under surprise, and internal anomaly detection — check whether scale, architectural changes (attention patterns, MoE routing), training harnesses (DPO variants, constitutional AI), or newer evals have since relaxed these signatures or shifted the regime. Separate the durable claim (models can internally track surprise) from perishable limits (specific detection thresholds, safety training suppression rates). Cite what changed it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially anything showing reflection *does* catch mismatches, or entropy signatures that don't scale, or safety approaches that preserve anomaly detection.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Can multi-agent or mixture-of-experts orchestration route surprise signals in ways that bypass safety suppression? (b) Do newer memorization-mitigation techniques accidentally restore or degrade surprise tracking?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What is the behavioral signature of a model tracking input surprise?

Sources 5 notes

Next inquiring lines