Why does reapplying the same transformer block work better than computing new layers?

This explores why looping or reusing one transformer block — instead of stacking fresh, separately-trained layers — can match or beat the usual deep stack, and what that reveals about what 'depth' is actually doing.

This explores why reapplying the same transformer block — looping it, or sharing its weights across positions — can outperform computing genuinely new layers, and the corpus has two very different reasons that happen to point the same way: one about hardware, one about what depth is really for.

The blunt practical reason comes from mobile. On memory-bound devices the bottleneck isn't math, it's moving weights into the compute units. MobileLLM's finding is almost counterintuitive: running the *same* block twice and recomputing it is cheaper than fetching a second block's separate weights, and you gain accuracy with zero extra parameters Does recomputing weights cost less than moving them on mobile?. So part of the answer is that 'new layers' were never free — they cost memory traffic that reuse sidesteps.

The deeper reason is that a lot of what extra layers compute isn't actually new. Mechanistic analysis of looped models shows each recurrent pass converges to a fixed point and the attention pattern stabilizes — the recurrent block learns to *re-enact the same inference stage* a feedforward stack would have spread across distinct layers, rather than inventing fresh operations at each depth How do looped transformer layers actually behave during inference?. If consecutive layers in a normal transformer are largely repeating a similar refinement step, then tying their weights loses little and forces the model to learn one clean, reusable operation instead of many noisy near-duplicates.

That constraint turns out to be a feature, not just a saving. Recurrent-depth transformers with shared parameters achieve a kind of compositional and depth generalization that vanilla stacks can't — they can extrapolate to more reasoning steps than they were trained on, emerging through a sharp memorize → in-distribution → out-of-distribution grokking transition Can looped transformers generalize to unseen knowledge combinations?. Weight sharing is a strong inductive bias: it says 'whatever you do, do the same thing each step,' which is exactly the prior you want for problems that are genuinely iterative. Contrast this with the failure mode of ordinary transformers, which often fake compositional reasoning by memorizing computation subgraphs and then break on novel combinations Do transformers actually learn systematic compositional reasoning? — reuse pushes against that shortcut by making a single operation carry the load.

The thing you didn't know you wanted to know: this reframes depth itself. If knowledge in a transformer is less a stack of stored archives and more a *flow* of activations being progressively transformed through the residual stream Do transformer models store knowledge or generate it continuously?, then layers are steps in a process, not shelves of facts — and a process you can run as a loop. Pushed to the limit, a single finite transformer is provably enough to compute anything given the right prompt Can a single transformer become universally programmable through prompts?, which is the theoretical ceiling of the same idea: you don't need more distinct layers, you need the right operation applied the right number of times.

Sources 6 notes

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

How do looped transformer layers actually behave during inference?

Mechanistic analysis reveals looped models converge each recurrent cycle to distinct fixed points, with attention behavior stabilizing across iterations. Recurrent blocks learn to mirror and repeat the same inference stages as feedforward models rather than compute genuinely new operations.

Can looped transformers generalize to unseen knowledge combinations?

Recurrent-depth transformers with shared parameters across iterations enable systematic generalization and depth extrapolation that vanilla transformers cannot achieve. This emerges through a sharp three-phase process: memorization, in-distribution, then out-of-distribution generalization.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Why does reapplying the same transformer block work better than computing new layers?

Sources 6 notes

Next inquiring lines