Where does a model store memorized paragraphs?
Can we pinpoint the specific layers, attention heads, and tokens where language models localize verbatim memorization? Understanding this spatial signature could enable targeted unlearning.
Can we localize where a model stores the verbatim paragraphs it can recite? This study (on GPT-Neo 125M / the Pile) finds that while memorization is spread across layers and components, it has a distinguishable spatial signature: gradients of memorized paragraphs are larger in lower layers than for non-memorized examples — so memorized examples can be unlearned by finetuning only the high-gradient weights. A specific low-layer attention head is especially involved, and it predominantly attends to distinctive, rare tokens (least frequent in the corpus unigram distribution). Token-perturbation analysis shows memorization is concentrated in a few distinctive early-prefix tokens — corrupting them often corrupts the entire continuation. And memorized continuations are harder to unlearn and to corrupt than non-memorized ones.
The keeper is the localization signature: memorization, though distributed, leaves a low-layer / rare-token / early-prefix fingerprint that makes it targetable for unlearning — and rare tokens are the hook the model hangs verbatim recall on.
This deepens the vault's memorization thread mechanistically. It complements the capacity account in When do language models stop memorizing and start generalizing? and the fine-tuning-leakage measurement in Does repeated sensitive data in fine-tuning cause memorization? by saying where the memorized content lives and how to target it.
Inquiring lines that use this note as a source 13
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes knowledge seeding equivalent to hippocampal replay in the brain?
- Why does in-weight memorization fail compared to tool-based fact access?
- Why does attending to own latents work better than bolted-on external memory stores?
- How does in-weight memorization scale with model parameter count?
- Why does semantic deduplication reduce memorization in fine-tuned models?
- What is the theoretical capacity limit before memorization saturates?
- Can contamination-free evaluation distinguish between memorization and genuine prediction ability?
- How does disentangled attention separate text from spatial reasoning?
- Can we unlearn memorized text by finetuning only high-gradient weights?
- What makes memorized paragraphs harder to corrupt than generic text?
- Why are rare tokens the hooks for verbatim model memorization?
- Can document repetition accidentally memorize sensitive information instead of learning?
- What makes factual memorization less efficient than tool-based retrieval?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
When do language models stop memorizing and start generalizing?
Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
capacity account; this adds the spatial localization signature
-
Does repeated sensitive data in fine-tuning cause memorization?
When language models train on the same private or proprietary data multiple times, how much do they end up memorizing and leaking that information at inference time? Understanding this risk is critical for organizations fine-tuning on confidential datasets.
the leakage this localization could target for unlearning
-
Do hidden massive activations act as attention bias terms?
Explores whether a tiny handful of unusually large activations in LLMs function as structural bias terms that shape attention patterns, regardless of input content.
both find specific low-level components doing outsized structural work
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Localizing Paragraph Memorization in Language Models
- Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
- How new data permeates LLM knowledge and how to dilute it
- How much do language models memorize?
- Spurious Forgetting in Continual Learning of Language Models
- Emergent Introspective Awareness in Large Language Models
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Titans: Learning to Memorize at Test Time
Original note title
paragraph memorization localizes to low-layer gradients and a rare-token attention head and a few prefix tokens can corrupt the whole continuation