How does content-only knowledge in LLMs enable pretraining popularity to leak through?
This explores how facts an LLM picks up passively across pretraining — encoded as content without being grounded in use — let raw frequency signals (how often something appeared) bleed into what the model produces.
This explores how 'content-only' knowledge — things a model absorbed because they were *present and repeated* in pretraining, not because the model learned to ground or apply them — becomes the channel through which the popularity of the training data itself leaks into outputs. The corpus doesn't tackle 'popularity leakage' under that exact name, but several notes circle the mechanism from different sides, and read together they sketch it clearly.
The foundational move is the split between what a model *encodes* and what it *uses*. Research shows LLMs routinely store facts in their representations while those facts fail to causally drive generation Do language models actually use their encoded knowledge?. When encoding and usage come apart like this, what's left steering the output isn't grounded reasoning — it's the statistical residue of the training distribution. The more often something appeared, the more it dominates that residue. So 'content-only' knowledge is exactly the kind of knowledge whose strength is set by frequency rather than by whether the model actually understands or can apply it.
That frequency signal turns out to be surprisingly active. LLMs perform out-of-context reasoning across the *whole* training distribution, stitching together implicit hints scattered across many documents to reconstruct facts never stated in any single one Can LLMs reconstruct censored knowledge from scattered training hints?. This is popularity leaking through by aggregation: the model isn't recalling a source, it's integrating how often and how widely something co-occurred. The same property is what lets LLMs convincingly *simulate* search engines purely from internal knowledge — the 'results' they generate are a readout of what the training corpus emphasized Can LLMs replace search engines during agent training?.
The failure modes corpus shows why this matters rather than being a curiosity. Potemkin understanding — fluent correct explanation paired with failed application — reveals explanation and execution running on functionally disconnected pathways Can LLMs understand concepts they cannot apply?. The explanation pathway is the content-only one: it can recite the popular framing of a concept while the model can't act on it. That gap between pattern-tracking and actual competence is the structural home of leakage How do LLMs fail to know what they seem to understand?. And models are poor at noticing it themselves — their self-reports are surface-level and shift under pressure, so they can't flag when an answer is riding frequency rather than knowledge How well do language models understand their own knowledge?.
The through-line the corpus leaves you with: 'hallucination' frames the problem as the model inventing things, but a quieter failure is the model faithfully reproducing *what was common* and presenting that as what's *true* or *applicable*. Popularity leakage isn't a bug in retrieval — it's the default behavior of a system whose knowledge lives as undigested content, where 'how often' silently substitutes for 'how right.'
Sources 6 notes
Multiple studies confirm that language models can encode facts in their representations while those facts fail to causally affect downstream outputs. Encoding and usage are distinct processes.
Language models perform out-of-context reasoning across the full training distribution, reconstructing information never explicitly stated in any single document. Experiments show models can infer city identities from scattered distance relationships and apply them downstream without in-context learning.
ZeroSearch and SSRL demonstrate that LLMs can generate relevant documents and search results from internal knowledge, with 14B simulators matching or exceeding real search engines. Curriculum degradation and test-time scaling optimize this approach for training without API costs.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.
LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.