Can looped transformers generalize to unseen knowledge combinations?
Do transformers that reuse layers across iterations succeed where standard transformers fail at composing facts in novel ways? This matters because systematic generalization is a hallmark of human reasoning.
Vanilla transformers fail at implicit multi-hop reasoning because knowledge is distributed across specific layers. If "the performer of Imagine is John Lennon" lives in shallow layers and "the spouse of John Lennon is Yoko Ono" lives in shallow layers too, deeper layers cannot access it — parameters are not shared across layers. The model can memorize trained compositions but cannot systematically generalize to novel combinations.
Recurrent-depth transformers (looped transformers) solve this by applying the same set of layers iteratively. Because parameters are shared across iterations, every iteration can access all stored knowledge regardless of where it was originally encoded. This architectural change enables two capabilities vanilla transformers lack:
Systematic generalization: Composing knowledge not used in any combination during training. The model can answer "the spouse of the performer of Imagine" even if this specific composition was never trained, because each iteration can independently retrieve and chain any stored fact.
Depth extrapolation: Generalizing to reasoning depths beyond training (e.g., train on 5-hop, evaluate on 10-hop) by simply running more recurrent iterations at inference time. This is a form of serial compute scaling where adding more iterations at test time extends reasoning depth.
The emergence of systematic generalization follows a sharp three-stage grokking process: (1) memorization of training compositions, (2) in-distribution generalization, then (3) systematic out-of-distribution generalization. The transition to stage 3 is abrupt, not gradual — it represents a qualitative phase change in how the model uses its recurrent structure. This connects to What happens inside models when they suddenly generalize? but adds the novel dimension that recurrence enables a third phase (systematic generalization) that non-recurrent models never reach.
The training-time recurrence strategy matters critically: dynamic recurrence (varying the number of iterations during training) achieves the strongest extrapolation, suggesting the model must learn to use arbitrary iteration counts rather than being trained on a fixed number.
A key limitation: overthinking degrades performance at extreme recursion depths. More iterations initially help, then hurt — the model over-refines beyond what the problem requires. This connects to Does more thinking time always improve reasoning accuracy? and Do iterative refinement methods suffer from overthinking?, extending the overthinking problem from token-level CoT to architectural-level recurrence.
The deeper implication connects to Can recurrent hierarchies achieve reasoning that transformers cannot?: both HRM and looped transformers escape the fixed-depth constraint of standard transformers, but through different mechanisms. HRM uses hierarchical multi-timescale recurrence for latent reasoning without CoT. Looped transformers use flat parameter-sharing recurrence for compositional generalization over stored knowledge. Both point to the same conclusion: the standard transformer's fixed depth is the bottleneck for reasoning, and recurrence is the solution — the question is which form of recurrence for which type of task.
Original note title
recurrent-depth transformers achieve compositional generalization over parametric knowledge through a three-stage grokking process that vanilla transformers cannot