INQUIRING LINE

What computational stages does a looped block re-enact across multiple iterations?

This explores what a recurrent (looped) transformer block actually does each time it runs again — whether repeating the same block produces new computation or just re-runs the same inference stages a feedforward model would have done in separate layers.


This explores what a recurrent (looped) transformer block actually does each time it runs again — and the surprising answer is that it mostly re-enacts the *same* feedforward stages of inference rather than computing genuinely new operations. Mechanistic analysis shows each recurrent cycle converges toward a distinct cyclic fixed point, with attention behavior stabilizing across iterations; the looped block learns to mirror and repeat the inference stages a deep feedforward model would have spread across separate layers How do looped transformer layers actually behave during inference?. So 'depth' achieved by looping is, computationally, the model re-walking a sequence of stages it has folded into shared weights.

What are those stages? A complementary line of work on how transformers actually acquire multi-step reasoning finds a consistent three-phase signature — memorization, in-distribution generalization, then cross-distribution (compositional) reasoning — with successful reasoning marked by cosine clustering of entity representations How do transformers learn to reason across multiple steps?. The same three-stage arc reappears when shared-parameter recurrent-depth transformers grok compositional generalization: memorize, fit in-distribution, then extrapolate out of distribution Can looped transformers generalize to unseen knowledge combinations?. The looped block is, in effect, re-enacting this staged progression each pass, which is why parameter-sharing across iterations buys systematic generalization and depth extrapolation that a vanilla fixed-depth transformer can't reach.

There's a deeper reason looping helps at all: a fixed-depth transformer is bounded by a complexity ceiling (the AC0/TC0 regime), and adding effective depth through recurrence is one way to escape it — as the Hierarchical Reasoning Model shows by coupling slow abstract planning with fast detailed computation across two timescales to solve Sudoku and mazes that chain-of-thought can't Can recurrent hierarchies achieve reasoning that transformers cannot?. Looking inside the iterations, hidden-state reasoning graphs reveal literal *cycles* — distilled reasoning models show ~5 cycles per sample versus near-zero in base models, and those cycles line up with documented 'aha moments' where the model reconsiders an intermediate answer Do reasoning cycles in hidden states reveal aha moments?. The re-enacted stage, in other words, isn't always passive repetition; some loops are the model revisiting and revising.

The lateral payoff here is that not all re-enactment is equally useful, and some of it can be pruned. Test-time analysis of attention maps finds reasoning steps fall into categories where verification and backtracking receive minimal downstream attention — letting you cut ~75% of steps without losing accuracy Can reasoning steps be dynamically pruned without losing accuracy?. And rather than only looping deeper, you can loop *wider*: sampling parallel latent trajectories sidesteps the serial latency of depth-only scaling while matching its benefits Can reasoning systems scale wider instead of only deeper?. So the thing you didn't know you wanted to know: a looped block's iterations are largely a replay of the same staged inference pipeline, which is exactly why much of it is compressible and parallelizable instead of being irreducible new computation.


Sources 7 notes

How do looped transformer layers actually behave during inference?

Mechanistic analysis reveals looped models converge each recurrent cycle to distinct fixed points, with attention behavior stabilizing across iterations. Recurrent blocks learn to mirror and repeat the same inference stages as feedforward models rather than compute genuinely new operations.

How do transformers learn to reason across multiple steps?

Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.

Can looped transformers generalize to unseen knowledge combinations?

Recurrent-depth transformers with shared parameters across iterations enable systematic generalization and depth extrapolation that vanilla transformers cannot achieve. This emerges through a sharp three-phase process: memorization, in-distribution, then out-of-distribution generalization.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Do reasoning cycles in hidden states reveal aha moments?

Distilled reasoning models show ~5 cycles per sample versus near-zero in base models, and cyclicity correlates with accuracy. These cycles in hidden-state reasoning graphs directly map to RL-trained models' documented aha moments—moments when models reconsider intermediate answers.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Next inquiring lines