How much does memorization capacity limit a model's ability to learn new information?
This reads the question as: is there a fixed ceiling on what a model can hold, and does hitting it block new learning — or is 'capacity' less of a wall than it sounds?
This explores whether a model's memory has a hard ceiling that blocks new learning. The corpus suggests the ceiling is real but smaller than you'd guess, and — counterintuitively — filling it is often where the *good* learning starts rather than where it stops. The most striking number comes from work measuring a fixed memorization budget: GPT-family models store roughly 3.6 bits per parameter, and that capacity belongs to the model itself, not the training recipe When do language models stop memorizing and start generalizing?. The surprise is what happens when the budget fills: instead of breaking, the model undergoes a phase transition called grokking, shifting from rote memorization to genuine generalization. In that framing, capacity pressure is the *trigger* for learning to generalize, not the thing that prevents it.
But capacity is only a hard limit if you insist on cramming everything into the same weights. A recurring theme is that what looks like a capacity problem is really an *allocation* problem. Fast-Slow Training routes task-specific lessons into optimized prompts while barely touching parameters, hitting the same performance faster and with far less catastrophic forgetting — and the authors frame forgetting explicitly as 'a misallocation problem rather than an inherent cost' Can splitting adaptation into two channels reduce forgetting?. A related finding shows that keeping a model close to its base distribution (low KL drift) preserves its *plasticity* — its raw ability to keep learning new tasks — whereas heavier parameter rewrites stall out when the domain shifts Does staying close to the base model preserve learning ability?. So the limit on learning new information isn't just how many bits fit; it's how aggressively you overwrite the bits already there.
The other escape from the ceiling is to stop using parameters as the only store. Titans pairs short-term attention with a separate neural memory that adaptively records 'surprising' tokens, scaling past 2M-token contexts without the model's fixed weights ever being the bottleneck Can neural memory modules scale language models beyond attention limits?. Even more pointed: frozen agents — no weight updates at all — keep improving and transfer to new environments purely by structuring an external textual memory well, gaining 23 points over generic reflection Can frozen language models continually improve through memory structure alone?. The classic recommender-systems move makes the same bet from the architecture side, splitting a small memorization tower from a generalization tower so neither has to carry the other's load Can one model memorize and generalize better than two?. Learning new information, here, is decoupled from parameter capacity entirely.
There's also a darker mirror to all this: more memorization isn't free. In chain-of-thought reasoning, local memorization — leaning on the immediately preceding tokens — accounts for up to 67% of reasoning errors, especially as problems get harder and drift from the training distribution Where do memorization errors arise in chain-of-thought reasoning?. So memorized patterns can actively *crowd out* correct reasoning, which reframes the question: the danger isn't only running out of room, it's over-relying on what's already stored.
The twist worth taking away is that 'can't learn new information' usually isn't a storage failure at all. Models routinely possess knowledge they fail to use: a documented inference bottleneck where a subtle prompt cue recovers 15 points of accuracy because the knowledge was there, just not activated Why do language models fail to use knowledge they possess?. And when internal knowledge genuinely runs short, models can learn *when* to reach outside for it — framing retrieval as a decision recovers 22% accuracy by switching to external knowledge only when needed When should language models retrieve external knowledge versus use internal knowledge?. Capacity sets a floor on what can live in the weights; it sets almost no ceiling on what a model can learn to *work with*.
Sources 9 notes
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Agents using causal-form memory (preserving applicability conditions) outperform generic reflection by 23 points on repeated trials and gain 4-17 points transferring to new environments, showing memory shape matters more than parameter updates.
Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Models possess relevant knowledge but fail to activate it without explicit prompting. Adding subtle emphasis recovers 15.3 percentage points accuracy, and forcing enumeration of preconditions recovers 6-9 points, showing the bottleneck is in constraint inference, not storage.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.