Why does looping computation outperform adding more transformer layers?

This explores why reusing the same transformer layers in a loop often beats simply stacking more new layers — and what the corpus says about where that advantage comes from and where it stops.

This explores why reusing the same transformer layers in a loop often beats simply stacking more new layers. The clearest head-to-head evidence comes from masked diffusion models, where looping a handful of early-middle layers matched a same-size deeper model using 3.3× fewer FLOPs and beat deeper non-looped baselines on reasoning tasks Does looping layers beat adding depth in diffusion models?. The takeaway isn't that depth is useless — it's that *reused* computation buys more per parameter than *added* computation under a fixed budget.

The mechanistic reason is almost anticlimactic: a looped transformer doesn't invent new operations on each pass. It converges each recurrent cycle toward a fixed point and re-enacts the same inference stages a feedforward stack would have spread across separate layers How do looped transformer layers actually behave during inference?. So if a deep model is essentially repeating a similar transformation layer after layer, you can collapse all of that into one block applied many times — and pay for the weights only once. On memory-bound hardware this flips an intuition entirely: recomputing a shared block twice is cheaper than fetching two separate sets of weights, because moving weights, not computing with them, is the bottleneck Does recomputing weights cost less than moving them on mobile?.

But the more interesting payoff is in *generalization*, not just efficiency. Sharing parameters across iterations lets recurrent-depth transformers extrapolate to more reasoning steps than they were trained on — systematic compositional generalization that vanilla transformers can't reach, emerging through a sharp three-phase grokking process Can looped transformers generalize to unseen knowledge combinations?. A fixed stack of distinct layers tends to memorize computation subgraphs and fails when asked to compose them in new ways Do transformers actually learn systematic compositional reasoning?; a loop, by contrast, learns *one* operation it can apply a variable number of times, which is exactly what "do this again" reasoning needs.

There's also a hard theoretical ceiling that depth alone struggles to escape. Fixed-depth transformers sit inside the TC0/AC0 complexity class, which caps the kinds of problems they can solve no matter how you stack them. A model that loops — coupling slow abstract planning with fast detailed computation across two timescales — breaks past that ceiling, solving Sudoku and mazes where chain-of-thought fails outright, with only 27M parameters Can recurrent hierarchies achieve reasoning that transformers cannot?. Iterative self-application is also what lets standard transformers jump from 10-digit to 100-digit addition by retraining on their own filtered-correct outputs, improving exponentially rather than linearly Can transformers improve exponentially by learning from their own correct solutions?.

The thing you didn't know you wanted to know: looping wins precisely *because* it doesn't do anything new each pass. Depth scaling spends parameters teaching each layer a fresh trick; looping bets that hard problems are the same trick repeated, and that bet pays off in FLOPs, in weight-movement cost, in compositional reach, and in escaping a complexity class that more layers can't.

Sources 7 notes

Does looping layers beat adding depth in diffusion models?

LoopMDM shows that looping early-middle layers is more efficient than adding depth: it matches same-size models with 3.3× fewer FLOPs and beats deeper non-looped baselines on reasoning tasks. Reused computation proves more effective than added depth under fixed parameter budgets.

How do looped transformer layers actually behave during inference?

Mechanistic analysis reveals looped models converge each recurrent cycle to distinct fixed points, with attention behavior stabilizing across iterations. Recurrent blocks learn to mirror and repeat the same inference stages as feedforward models rather than compute genuinely new operations.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Can looped transformers generalize to unseen knowledge combinations?

Recurrent-depth transformers with shared parameters across iterations enable systematic generalization and depth extrapolation that vanilla transformers cannot achieve. This emerges through a sharp three-phase process: memorization, in-distribution, then out-of-distribution generalization.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can transformers improve exponentially by learning from their own correct solutions?

Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.

Why does looping computation outperform adding more transformer layers?

Sources 7 notes

Next inquiring lines