LLM Reasoning and Architecture Reinforcement Learning for LLMs

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Note · 2026-02-23 · sourced from MechInterp

Grokking — the phenomenon where models trained far beyond overfitting suddenly generalize — appears discontinuous from the outside. Mechanistic analysis reveals three continuous phases underneath:

Memorization phase. The model learns to reproduce training data through lookup-table-like mechanisms. Training loss drops, test loss remains high. The memorizing circuit dominates.
Circuit formation phase. A generalizing circuit gradually forms in the weights, competing with the memorizing circuit. For modular addition, this circuit uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. The generalizing circuit is more efficient (uses regularization-favored structure) but initially weaker.
Cleanup phase. The generalizing circuit overtakes the memorizing circuit. Memorization components are pruned away. Test loss drops. Generalization emerges.

Progress measures defined through mechanistic analysis (tracking the formation of specific algorithmic components) allow monitoring grokking as it happens, replacing the seemingly sudden shift with continuous, predictable development.

Two composition findings from the grokked transformers paper:

Composition reasoning (combining facts transitively: A→B and B→C implies A→C) generalizes in-distribution but fails out-of-distribution
Comparison reasoning (comparing attributes: is A greater than B?) generalizes both in-distribution and out-of-distribution

The difference correlates with the circuit configuration — comparison allows more systematic generalization because the comparison operation is simpler to represent compactly. The paper recommends cross-layer knowledge sharing mechanisms (memory augmentation, explicit recurrence) to further unlock transformer generalization.

Formal capacity trigger: The memorization capacity paper (2505.24832) adds a crucial quantitative dimension: GPT-family models have an approximate capacity of 3.6 bits-per-parameter for unintended memorization. Models memorize until this capacity fills, at which point grokking begins and unintended memorization decreases as generalization takes over. This means the three phases are not triggered by training duration per se, but by a measurable capacity saturation event. The paper also formally separates memorization into unintended memorization (information about a specific dataset) and generalization (information about the true data-generation process), and argues that extraction/generation is neither necessary nor sufficient proof of memorization — a model may memorize patterns without reproducing them verbatim.

This connects to How does multi-hop reasoning develop during transformer training? — both describe staged development of reasoning capability, but grokking requires training far beyond the typical schedule. The practical tension: standard training may terminate before the cleanup phase, leaving models in the memorization phase where they appear to have learned but haven't generalized.

Source: MechInterp; enriched from Memory

Related concepts in this collection

How does multi-hop reasoning develop during transformer training? Does implicit multi-hop reasoning emerge gradually through distinct phases? This explores whether transformers move from memorization to compositional generalization, and what internal mechanisms enable that shift.
parallel staged development: memorization → in-distribution → cross-distribution; grokking adds the requirement for extended training
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
grokking suggests capabilities are present but buried under memorization components that must be cleaned up
Does RL training follow a predictable two-phase learning sequence? This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
analogous phased development in RL training
When do language models stop memorizing and start generalizing? Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
adds the quantitative trigger: grokking begins when 3.6 bits-per-parameter memorization capacity saturates, not at an arbitrary training step

Concept map

15 direct connections · 163 in 2-hop network ·dense cluster

What happens inside models when they suddenly ge… How does multi-hop reasoning develop during transf… Do base models already contain hidden reasoning ab… Does RL training follow a predictable two-phase le… When do language models stop memorizing and start …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

grokking reveals three continuous phases of learning — memorization then circuit formation then cleanup