LLM Reasoning and Architecture Reinforcement Learning for LLMs

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Grokking — the phenomenon where models trained far beyond overfitting suddenly generalize — appears discontinuous from the outside. Mechanistic analysis reveals three continuous phases underneath:

  1. Memorization phase. The model learns to reproduce training data through lookup-table-like mechanisms. Training loss drops, test loss remains high. The memorizing circuit dominates.

  2. Circuit formation phase. A generalizing circuit gradually forms in the weights, competing with the memorizing circuit. For modular addition, this circuit uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. The generalizing circuit is more efficient (uses regularization-favored structure) but initially weaker.

  3. Cleanup phase. The generalizing circuit overtakes the memorizing circuit. Memorization components are pruned away. Test loss drops. Generalization emerges.

Progress measures defined through mechanistic analysis (tracking the formation of specific algorithmic components) allow monitoring grokking as it happens, replacing the seemingly sudden shift with continuous, predictable development.

Two composition findings from the grokked transformers paper:

The difference correlates with the circuit configuration — comparison allows more systematic generalization because the comparison operation is simpler to represent compactly. The paper recommends cross-layer knowledge sharing mechanisms (memory augmentation, explicit recurrence) to further unlock transformer generalization.

Formal capacity trigger: The memorization capacity paper (2505.24832) adds a crucial quantitative dimension: GPT-family models have an approximate capacity of 3.6 bits-per-parameter for unintended memorization. Models memorize until this capacity fills, at which point grokking begins and unintended memorization decreases as generalization takes over. This means the three phases are not triggered by training duration per se, but by a measurable capacity saturation event. The paper also formally separates memorization into unintended memorization (information about a specific dataset) and generalization (information about the true data-generation process), and argues that extraction/generation is neither necessary nor sufficient proof of memorization — a model may memorize patterns without reproducing them verbatim.

This connects to How does multi-hop reasoning develop during transformer training? — both describe staged development of reasoning capability, but grokking requires training far beyond the typical schedule. The practical tension: standard training may terminate before the cleanup phase, leaving models in the memorization phase where they appear to have learned but haven't generalized.


Source: MechInterp; enriched from Memory

Related concepts in this collection

Concept map
15 direct connections · 163 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

grokking reveals three continuous phases of learning — memorization then circuit formation then cleanup