How much do language models memorize?

Paper · arXiv 2505.24832 · Published May 30, 2025
MemoryMechInterp

We propose a new method for estimating how much a model “knows” about a datapoint and use it to measure the capacity of modern language models. We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. By eliminating generalization, we can compute the total memorization of a given model, which provides an estimate of model capacity: our measurements estimate that models in the GPT family have an approximate capacity of 3.6 bits-per-parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point “grokking” begins, and unintended memorization decreases as models begin to generalize.

Studies of language model extraction argue that a data point is memorized if we can induce the model to generate it (Carlini et al., 2023b; Nasr et al., 2023; Schwarzschild et al., 2024). We argue that such generation does not necessarily serve as a proof of memorization. Language models can be coerced to output almost any string (Geiping et al., 2024); hence the fact that a model outputs something is not necessarily a sign of memorization. To address this issue, some researchers have suggested regularizing the input to the language model, such as by limiting its length Schwarzschild et al. (2024) or matching it to the prefix Carlini et al. (2023b) preceding the memorized sentence. However, even with these constraints, memorization cannot be conclusively proven, as the model’s ability to generalize may still be at play. For example, a good language model prompted to add two numbers can output the correct answer without having seen the equation before. In fact, a recent work Liu et al. (2025) shows that some of the instances than were previously thought as memorized do not even exist in the training set and their extractability is a result of generalization. Additionally, verbatim reproduction of a text is not a prerequisite for memorization; a model may still be memorizing specific patterns or sequences, such as every other token, without generating them verbatim.

Given that extraction is neither necessary nor sufficient, defining memorization accurately is a pressing concern. Existing mathematical definitions of memorization, such as those based on membership inference Shokri et al. (2017) and differential privacy Dwork (2006), are defined in the dataset/distribution level. This makes them inadequate for measuring memorization for certain instances (e.g. a particular textbook). Theoretical notions of ’influence’ Feldman (2020); Feldman & Zhang (2020) have been proposed to define memorization at the instance level, but they fall short of meeting our needs. These definitions focus on the capacity of a training algorithm to memorize inputs, whereas our interest lies in understanding how much a specific data point is memorized within a given model. This distinction is crucial, as we are concerned with the memorization properties of a single model, rather than the distribution of models generated by a particular training algorithm.