How does dynamic recurrence during training improve depth extrapolation?

This reads as a question about how letting a model loop its own computation (recurrence) — rather than stacking a fixed number of layers — lets it reason at greater effective depth than its trained size would suggest, and the corpus circles this idea from several angles even though no single note nails the exact phrase.

This explores why looping a model's computation during training can let it reason "deeper" at test time than a fixed-layer transformer ever could — and the corpus approaches that idea from a few different directions rather than one tidy answer. The cleanest anchor is the Hierarchical Reasoning Model Can recurrent hierarchies achieve reasoning that transformers cannot?, which couples a slow planning loop with a fast computation loop across two timescales. Because the computation recurs instead of running once through fixed layers, the model reaches an *effective* depth far beyond its 27M parameters — enough to solve Sudoku and mazes that chain-of-thought methods fail outright. The key claim there is that fixed-depth transformers are pinned under a complexity ceiling (the AC0/TC0 limit), and recurrence is the lever that escapes it. That's the heart of "depth extrapolation": the trained network is small, but the unrolled computation can be made arbitrarily deep.

Why depth itself matters — and why getting more of it is worth the trouble — shows up vividly in Does network depth unlock qualitatively new behaviors in RL?. Scaling self-supervised RL networks to extreme depth doesn't yield gradual gains; it produces sudden jumps at specific thresholds (depth 16 unlocks walking, depth 256 unlocks wall-climbing). Capability appears to be gated behind reachable depth. Recurrence is attractive precisely because it offers a cheaper route to that depth: instead of physically stacking a thousand layers, you reuse a smaller block many times.

The corpus also offers a useful counter-current — depth isn't the only axis worth scaling. Can reasoning systems scale wider instead of only deeper? argues that sampling parallel latent trajectories (width) sidesteps the serial latency cost of going deeper, matching the benefits of depth without the wait. And Can abstractions guide exploration better than depth alone? shows that pure depth-only reasoning chains hit an "underthinking" failure mode that breadth-first abstraction avoids. Read against the recurrence papers, these suggest the real win isn't depth for its own sake but a controllable compute budget — and recurrence is one way to make depth a dial you can turn at inference rather than a fixed architectural choice.

There's a quieter, more surprising thread too: recurrence doesn't have to be in service of predicting the next token at all. Can recurrence consolidate memory without predicting tokens? describes recurrent passes that run *without input tokens*, transferring recent context into persistent fast weights — like hippocampal replay during sleep. This reframes "dynamic recurrence during training" as something broader than deeper forward passes: recurrence can be scheduled, allocated, and repurposed for consolidation, which is itself a way of building reusable depth that the model carries forward.

One honest caveat: none of these notes runs the specific experiment of varying recurrence depth at train time and measuring extrapolation to unseen depths at test time, so the literal mechanism in your question is assembled here from adjacent evidence rather than quoted from one source. If you want the single most direct doorway, start with the Hierarchical Reasoning Model Can recurrent hierarchies achieve reasoning that transformers cannot? — it's the note where recurrence and escaping fixed-depth limits meet most explicitly.

Sources 5 notes

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Does network depth unlock qualitatively new behaviors in RL?

Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can recurrence consolidate memory without predicting tokens?

Language models can use recurrent passes without input tokens to transfer recent context into persistent fast weights via learned local rules, mirroring hippocampal replay during biological sleep. This separates consolidation from prediction, enabling different scheduling and compute allocation.

How does dynamic recurrence during training improve depth extrapolation?

Sources 5 notes

Next inquiring lines