How does memorization capacity saturation trigger the grokking transition?
This explores the claim that generalization (grokking) doesn't begin because the model gets smarter, but because it runs out of room to keep memorizing — and what the corpus says about that tipping point.
This reads the question as asking about a specific mechanism: the idea that a model memorizes training examples until it physically can't anymore, and only then does it switch to learning rules that generalize. The corpus has two notes that sit almost directly on this claim, and they tell a surprisingly concrete story. When do language models stop memorizing and start generalizing? puts a number on the ceiling — GPT-family models hold roughly 3.6 bits per parameter, and when that storage fills up, a phase transition kicks in and grokking begins. The striking part is that this capacity is a property of the *model itself*, not of the training recipe. The model has a fixed-size box, and grokking is what happens when the box is full.
What fills the box and what happens next is told mechanistically by What happens inside models when they suddenly generalize?. Rather than a single flip, grokking unfolds in three measurable stages: the model first builds lookup-table-style memorization, then slowly grows 'circuits' that actually generalize, and finally *prunes away* the memorized components it no longer needs. From the outside this looks like a sudden jump in test accuracy, but internally it's continuous — and the trigger that starts the whole cascade is exactly that memorization capacity saturating. So the two notes interlock: one names the ceiling (3.6 bits/parameter), the other shows what the model does once it hits it.
The interesting turn for a curious reader is that 'memorization' here isn't one monolithic thing. Where do memorization errors arise in chain-of-thought reasoning? breaks reasoning-time memorization into local, mid-range, and long-range sources, finding that *local* memorization (leaning on the immediately preceding tokens) drives up to 67% of errors, and that this reliance grows precisely as tasks get harder and drift away from the training distribution. That's a useful companion idea: the same memorization tendency that grokking eventually prunes is also what reasoning models fall back on when they're out of their depth.
There's a second lateral thread worth pulling. If grokking is the model reallocating from memorization to general circuits, then several notes here treat 'where you put what you've learned' as the deeper design question. Can splitting adaptation into two channels reduce forgetting? argues that catastrophic forgetting is a *misallocation* problem, not an inherent cost — route durable lessons into slow weights and disposable ones into fast textual context and forgetting largely evaporates. Read alongside grokking, both point at the same underlying truth: learning is partly a budgeting problem, and the dramatic-looking transitions happen when the budget gets reallocated. Do language models sparsify their activations under difficult tasks? adds a related wrinkle — models adaptively *sparsify* their activations under unfamiliar tasks, a kind of self-imposed compression that stabilizes rather than breaks performance.
If you want to chase the thread further, the honest caveat is that the corpus is strongest on the two grokking-specific notes and uses the rest as conceptual neighbors rather than direct evidence. But the payoff for a curious reader is this: the moment a model 'suddenly gets it' may be less a flash of insight than the predictable consequence of a full memory finally forcing a cheaper strategy — and that the size of that memory is a measurable, model-specific number.
Sources 5 notes
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Models trained past overfitting generalize through three stages: memorization via lookup tables, gradual formation of generalizing circuits, then pruning of memorization components. Mechanistic analysis shows this appears discontinuous externally but progresses continuously, triggered by memorization capacity saturation.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.