Does grokking in modular arithmetic follow the same three-phase learning trajectory?

This explores whether the classic grokking story from modular arithmetic — train accuracy saturates, then generalization snaps in later — shows up as a clean three-phase arc, and what the corpus says about the shape of that memorize-then-generalize transition.

This reads the question as being about the *shape* of grokking — is it really a tidy three-phase trajectory? — rather than about modular arithmetic specifically, which the corpus doesn't cover head-on. What the collection does have is several independent attempts to count the phases of memorize-then-generalize learning, and they disagree in an interesting way.

The cleanest grokking result here frames it as a *two-state* phase transition, not three. Models memorize until they hit a measurable capacity ceiling — about 3.6 bits per parameter — and only once that storage fills does the shift to genuine generalization kick in When do language models stop memorizing and start generalizing?. On that account grokking isn't a gradual trajectory at all; it's a threshold you cross when memorization stops being a viable strategy. A related finding from RLVR shows the tail end of this directly: a model can keep improving its test accuracy for 1,400 steps *after* training accuracy already hit 100% Can a single training example unlock mathematical reasoning? — the signature post-saturation gap that makes grokking look like delayed understanding.

Where a genuine three-phase structure does appear is in transformers learning multi-hop reasoning: memorization, then in-distribution generalization, then cross-distribution reasoning, with the jump to true reasoning marked by entity representations clustering together in the model's internal space How do transformers learn to reason across multiple steps?. That's the closest the corpus comes to validating a three-phase grokking arc — but notice it's three phases because the *task* has a compositional second hop, not because grokking inherently has three stages. The RL literature, meanwhile, counts *two* phases (master execution, then master strategy) Does RL training follow a predictable two-phase learning sequence?. So the number of phases tracks the task's structure, not a universal law of learning.

The more unsettling thread: some of what looks like grokking may not be real generalization. Transformers often pass in-distribution tests by memorizing computation subgraphs and then collapse on novel compositions Do transformers actually learn systematic compositional reasoning?, and local token-level memorization — predicting from the immediately preceding tokens — accounts for up to 67% of reasoning errors, getting worse exactly when the problem shifts away from training distribution Where do memorization errors arise in chain-of-thought reasoning?. Modular arithmetic is the canonical grokking demo precisely because its clean algebraic structure lets you *prove* the model found the general rule. For messier tasks, a confident post-saturation accuracy curve might be the model getting better at subgraph matching rather than grokking the underlying function.

The thing worth taking away: 'how many phases' is the wrong question. The corpus suggests the real variable is whether the task has a capacity ceiling that forces memorization to fail (two-state transition) or a compositional layer that has to be learned separately on top (extra phase) — and whether the late-stage 'generalization' you observe is the genuine article or memorization wearing a disguise.

Sources 6 notes

When do language models stop memorizing and start generalizing?

GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

How do transformers learn to reason across multiple steps?

Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does grokking in modular arithmetic follow the same three-phase learning trajectory?

Sources 6 notes

Next inquiring lines