Localizing Paragraph Memorization in Language Models

Paper · arXiv 2403.19851 · Published March 28, 2024

Can we localize the weights and mechanisms used by a language model to memorize and recite entire paragraphs of its training data? In this paper, we show that while memorization is spread across multiple layers and model components, gradients of memorized paragraphs have a distinguishable spatial pattern, being larger in lower model layers than gradients of nonmemorized examples. Moreover, the memorized examples can be unlearned by fine-tuning only the high-gradient weights. We localize a low-layer attention head that appears to be especially involved in paragraph memorization. This head is predominantly focusing its attention on distinctive, rare tokens that are least frequent in a corpus-level unigram distribution. Next, we study how localized memorization is across the tokens in the prefix by perturbing tokens and measuring the caused change in the decoding. A few distinctive tokens early in a prefix can often corrupt the entire continuation. Overall, memorized continuations are not only harder to unlearn, but also to corrupt than non-memorized ones.

Introduction. Some language models are able to emit gigabytes of full-length paragraphs from their training data (Carlini et al., 2020, 2022; McCoy et al., 2023; Haviv et al., 2023; Nasr et al., 2023; New York Times, 2023). These memorized paragraphs must thus be represented somewhere in the model weights (Nasr et al., 2023). We take steps towards localizing these weights and internal mechanisms that are involved in the memorization of paragraphs. Specifically, we study in detail the open-weight model GPT-NEO 125M (Gao et al., 2021) which has been trained on the publicly available dataset the PILE. As a first step, we identify paragraphs that are memorized by a language model. We use the term “paragraph” for any sequence of 100 tokens. A paragraph is regarded as memorized if, given a prefix of 50 tokens, the model’s greedy decoding of the next 50 tokens exactly matches the true paragraph continuation. We publish the memorized paragraphs alongside our code. We use our dataset of memorized and nonmemorized paragraphs to identify differences in how they are processed by the model.

Discussion / Conclusion. Our focus lies on identifying “where” memorization-relevant model components may be localized, but our findings open up interesting follow-up questions on the “why” and “how”. In §5.3, we are unlearning and editing Gradients flow differently for memorized (more in lower layers) than for non-memorized paragraphs (more in higher layers). While many model components are involved, memorization is often localized to few, distinctive tokens in the prefix that are predominantly processed by the attention head 2 in layer 1 of GPT-NEO 125M.

Localizing Paragraph Memorization in Language Models

Synthesis notes that discuss concepts related to this paper