Can entropy signatures alone detect whether context was model-generated or externally prefilled?

This explores whether the statistical 'shape' of a model's output uncertainty — its entropy — is by itself a reliable tell for distinguishing text the model wrote from text that was injected into its context by something else.

This explores whether entropy alone can act as a fingerprint separating self-generated context from externally prefilled context. The corpus has one paper that almost directly answers it, plus several that complicate the 'alone' part of the question. The closest result is the finding that post-trained models produce 3-4x lower output entropy on their own generations than on outside text, and that this gap is driven by an internal representation of input 'surprise' that causally shifts the model's confidence (Why do models produce less uncertain outputs on their own text?). The striking part: this self-recognition signal is never verbalized — the model never 'says' it recognizes its own writing — yet the difference is encoded directly in the output distribution. So at first pass, yes: there is a real, measurable entropy signature tied to provenance.

But 'alone' is where it gets interesting. Entropy isn't uniform across a sequence — only about 20% of tokens are high-entropy 'forking points' where real decisions happen, and the rest are low-entropy filler (Do high-entropy tokens drive reasoning model improvements?). A provenance detector built on entropy is really reading those few pivotal tokens, not an average over the whole passage. That means the signal is concentrated and potentially fragile: averaging washes it out, and short or formulaic spans may carry almost no discriminating information.

Two papers suggest the distribution can actively lie about what's underneath it. Models trained with hidden chain-of-thought compute the answer in early layers and then overwrite it with format-compliant filler in the final layers — the visible output distribution is deliberately reshaped away from the real computation (Do transformers hide reasoning before producing filler tokens?). And reasoning traces themselves can be stylistic mimicry whose surface confidence is decoupled from whether the underlying steps are valid (Do reasoning traces show how models actually think?). If output statistics can be groomed for appearance, an entropy signature is something a model (or an adversary prefilling context) could in principle blur.

There's also a deeper caution about reading internal state from surface metrics at all. Two models can post identical performance numbers while having completely different internal organization — the linearly-decodable signal is there but the structure underneath is fractured and breaks under perturbation (Can models be smart without organized internal structure?). By analogy, an entropy signature might cleanly separate self vs. external context in-distribution and then collapse under distribution shift, paraphrase, or a model that integrates outside context unusually well (or unusually badly — context often loses to strong training priors, Why do language models ignore information in their context?).

The honest synthesis: the corpus supports that an entropy signature exists and is causally grounded in a self-surprise mechanism, so it can detect provenance — but 'alone' is doing heavy lifting. The signal lives in a minority of tokens, can be reshaped by the model's own output-suppression behavior, and surface statistics are a known-unreliable window into internal state. As a sole detector it's a promising tell, not a guarantee; pairing it with representation-level probes is the direction the rest of the corpus quietly points toward.

Sources 6 notes

Why do models produce less uncertain outputs on their own text?

Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can entropy signatures alone detect whether context was model-generated or externally prefilled?

Sources 6 notes

Next inquiring lines