Why do standard transformers fail to encode recursive structure in their hidden states?

This explores why fixed-depth transformers struggle to represent nested, self-referential structure (recursion) — and what the corpus suggests is missing architecturally versus what's a training artifact.

This reads the question as asking about a limit baked into the transformer's shape, not just a quirk of training: why don't hidden states naturally hold the kind of nested, stack-like structure that recursion (balanced brackets, grammar trees, multi-step plans) demands? The corpus points at two answers that pull in different directions.

The first is a hard ceiling. A standard transformer runs a fixed number of layers regardless of how deep the problem nests, which caps the 'computational depth' it can express — formally the AC0/TC0 complexity classes. The Hierarchical Reasoning Model makes this concrete: by coupling slow planning with fast computation across two timescales, a 27M-parameter recurrent model solves Sudoku and mazes where chain-of-thought transformers fail outright Can recurrent hierarchies achieve reasoning that transformers cannot?. Looping the same parameters repeatedly buys back some of the missing depth and unlocks compositional generalization vanilla transformers can't reach Can looped transformers generalize to unseen knowledge combinations?. So part of the failure is simply that recursion wants variable depth and the architecture supplies a constant.

The second answer is subtler and more damning: even within their depth budget, transformers tend not to *learn* recursive rules — they memorize the shapes that recursion would have produced. One study shows compositional reasoning in transformers collapses into 'linearized subgraph matching': the model stitches together computation paths seen in training and breaks badly on genuinely novel compositions, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. That's the opposite of holding a recursive structure in state — it's pattern-completion wearing recursion's clothes. The fix that works best for syntax is to *give* the model the missing machinery: a pushdown layer with an explicit stack tape yields 3–5x more sample-efficient syntactic generalization, which tells you the recursive bookkeeping was the bottleneck, not raw capacity Can explicit stack tracking improve how transformers learn recursive syntax?.

Here's the twist worth sitting with: the failure may be in the *learned weights*, not the architecture's reach. A single finite transformer provably exists that can compute any computable function given the right prompt — Turing-complete in principle — yet ordinary training almost never produces a model that implements arbitrary programs that way Can a single transformer become universally programmable through prompts?. The capacity is there; gradient descent just doesn't steer toward it. And when models do reach a clean internal organization, it's often through a sharp grokking transition rather than smooth learning Can looped transformers generalize to unseen knowledge combinations? — and two models with identical accuracy can have wildly different internal structure, one robust and one fractured, with the difference invisible to standard metrics Can models be smart without organized internal structure?.

What you didn't know you wanted to know: there's a counter-current suggesting transformers aren't as structureless as the pessimistic readings imply. Pruning experiments find networks spontaneously carve compositional tasks into isolated modular subnetworks, each cleanly ablatable Do neural networks naturally learn modular compositional structure?, and transformers can bootstrap themselves from 10-digit to 100-digit addition by retraining on their own filtered-correct outputs Can transformers improve exponentially by learning from their own correct solutions?. So the honest synthesis is: standard transformers don't fail to encode recursion because they *can't* — they fail because fixed depth caps it, training rarely finds the recursive solution over the memorized one, and the metrics we use can't tell the two apart. Add explicit stack/recurrence machinery and much of the failure evaporates.

Sources 8 notes

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can looped transformers generalize to unseen knowledge combinations?

Recurrent-depth transformers with shared parameters across iterations enable systematic generalization and depth extrapolation that vanilla transformers cannot achieve. This emerges through a sharp three-phase process: memorization, in-distribution, then out-of-distribution generalization.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Can explicit stack tracking improve how transformers learn recursive syntax?

Pushdown Layers—a drop-in self-attention replacement with explicit stack tracking—achieve 3-5x more sample-efficient syntactic generalization while maintaining perplexity. The improvement shows that recursive structure specifically benefits from architectural inductive bias despite general compositional generalization emerging from scale.

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can transformers improve exponentially by learning from their own correct solutions?

Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.

Why do standard transformers fail to encode recursive structure in their hidden states?

Sources 8 notes

Next inquiring lines