Why is extracting training data insufficient proof that models memorize?
This explores why a model spitting back its training data isn't enough to conclude it stored that data verbatim — because the same output can come from reconstruction, inference, or generalization rather than rote storage.
This reads the question as a problem of evidence, not behavior: getting a model to emit a piece of training text proves the model *can produce* it, not that it *stored a copy* of it. The corpus repeatedly shows these are different processes, and conflating them is exactly the trap. The cleanest counterexample is that models reconstruct information they were never given. Can LLMs reconstruct censored knowledge from scattered training hints? shows models inferring facts that appear in no single document — piecing together a city's identity from scattered distance relationships and using it downstream. If a model can output something that was never written anywhere in training, then outputting something that *was* written tells you little about whether it was memorized or simply reconstructed the same way.
That distinction has a formal backbone. When do language models stop memorizing and start generalizing? treats memorization as a measurable, bounded capacity (~3.6 bits per parameter) that is separate from generalization — once that capacity fills, models shift into grokking and start producing correct outputs *without* storing them. So memorization isn't "the model can generate this string"; it's a specific quantity that lives alongside, and gets crowded out by, genuine generalization. Does procedural knowledge drive reasoning more than factual retrieval? sharpens the dividing line: factual recall depends on narrow, document-specific memorization, but reasoning draws on broad, transferable procedural knowledge spread across many sources. An extraction that looks like recall might actually be the model re-deriving the answer from procedure.
There's an even deeper reason extraction is weak evidence: producing the right tokens can be decoupled from the model's internal state entirely. Do language models actually use their encoded knowledge? finds that models encode facts that never causally drive their outputs — encoding and usage are distinct. And Can AI pass every test while understanding nothing? goes further: two networks can emit identical outputs on every input while holding radically different internal representations, and no benchmark can tell them apart from behavior alone. If identical output can mask totally different internals, then identical-to-training output can't, by itself, certify memorization.
The flip side is that reproduction often reflects *structure* rather than stored content. What do models actually learn from chain-of-thought training? and Do reasoning traces need to be semantically correct? show models absorb the architecture of a reasoning trace — how steps connect — while tolerating heavy corruption of the actual numbers and facts. What gets reproduced is the shape, not the substance, which is the opposite of verbatim storage. Where genuine memorization does bleed into outputs, researchers have to isolate it carefully: Where do memorization errors arise in chain-of-thought reasoning? separates local, mid-range, and long-range memorization sources to find that local memorization causes most reasoning errors, and Do LLMs predict entailment based on what they memorized? catches memorization only by stripping the logical relationship away (random premises) and watching the model still favor attested hypotheses.
The payoff worth taking away: "can extract" and "did memorize" are different claims requiring different tests. Proving memorization means ruling out reconstruction, procedural re-derivation, and structural inference — and showing the stored content actually drives the output. Extraction alone clears none of those bars.
Sources 9 notes
Language models perform out-of-context reasoning across the full training distribution, reconstructing information never explicitly stated in any single document. Experiments show models can infer city identities from scattered distance relationships and apply them downstream without in-context learning.
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Multiple studies confirm that language models can encode facts in their representations while those facts fail to causally affect downstream outputs. Encoding and usage are distinct processes.
The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.
Controlled ablations show models tolerate 50% corrupted numbers (3.2% accuracy loss) but fail under step shuffling (13.3% loss). What distills across reasoning demonstrations is logical architecture—how steps sequence and connect—not factual accuracy.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.