Where does a model store memorized paragraphs?

Can we pinpoint the specific layers, attention heads, and tokens where language models localize verbatim memorization? Understanding this spatial signature could enable targeted unlearning.

Synthesis note · 2026-06-03 · sourced from Memory

Can we localize where a model stores the verbatim paragraphs it can recite? This study (on GPT-Neo 125M / the Pile) finds that while memorization is spread across layers and components, it has a distinguishable spatial signature: gradients of memorized paragraphs are larger in lower layers than for non-memorized examples — so memorized examples can be unlearned by finetuning only the high-gradient weights. A specific low-layer attention head is especially involved, and it predominantly attends to distinctive, rare tokens (least frequent in the corpus unigram distribution). Token-perturbation analysis shows memorization is concentrated in a few distinctive early-prefix tokens — corrupting them often corrupts the entire continuation. And memorized continuations are harder to unlearn and to corrupt than non-memorized ones.

The keeper is the localization signature: memorization, though distributed, leaves a low-layer / rare-token / early-prefix fingerprint that makes it targetable for unlearning — and rare tokens are the hook the model hangs verbatim recall on.

This deepens the vault's memorization thread mechanistically. It complements the capacity account in When do language models stop memorizing and start generalizing? and the fine-tuning-leakage measurement in Does repeated sensitive data in fine-tuning cause memorization? by saying where the memorized content lives and how to target it.

Inquiring lines that use this note as a source 13

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 135 in 2-hop network ·dense cluster Open in graph ↗

Where does a model store memorized paragraphs? When do language models stop memorizing and start … Does repeated sensitive data in fine-tuning cause … Do hidden massive activations act as attention bia…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

When do language models stop memorizing and start generalizing? Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
capacity account; this adds the spatial localization signature
Does repeated sensitive data in fine-tuning cause memorization? When language models train on the same private or proprietary data multiple times, how much do they end up memorizing and leaking that information at inference time? Understanding this risk is critical for organizations fine-tuning on confidential datasets.
the leakage this localization could target for unlearning
Do hidden massive activations act as attention bias terms? Explores whether a tiny handful of unusually large activations in LLMs function as structural bias terms that shape attention patterns, regardless of input content.
both find specific low-level components doing outsized structural work

Where does a model store memorized paragraphs?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4