Can latent recurrence overcome the trainability costs of depth?

This explores whether looping computation through a shared latent space — rather than physically stacking more layers — can buy you the benefits of depth (composing abstractions) without paying depth's training penalties (vanishing gradients, serial latency, parameter bloise).

This explores whether 'latent recurrence' — reusing the same computation block over and over in a latent space — can give a model the gains of depth without the costs that make deep stacks hard to train. The corpus doesn't answer this head-on with a single paper, but lay three findings side by side and a real answer emerges. First, depth genuinely matters: for small models, deep-and-thin architectures beat wide-and-shallow ones by composing abstract concepts layer by layer rather than spreading parameters sideways Does depth matter more than width for tiny language models?. So whatever you do, you want depth's compositional behavior. The question is how to get it cheaply.

Two approaches in the collection attack the cost directly. Latent-Thought Language Models add a scaling dimension that's independent of parameter count: they couple fast 'local' learning of latent thought vectors with slow 'global' learning of the decoder, and scale reasoning by growing the latent space instead of the layer stack Can latent thought vectors scale language models beyond parameters?. That's the heart of the 'latent recurrence' bet — compute more, train more, without making the network physically deeper. The catch the corpus surfaces is latency: depth is inherently serial, each layer waits on the one before it.

That's exactly the cost GRAM tries to dodge by scaling in width instead. It samples parallel latent trajectories — stochastic transitions that explore the solution space along independent paths — so you get more computation without the serial bottleneck of depth-only scaling, and crucially without inflating variance Can reasoning systems scale wider instead of only deeper?. Read together, Can latent thought vectors scale language models beyond parameters? and Can reasoning systems scale wider instead of only deeper? frame the real trade-off: latent recurrence can substitute for depth, but pure recurrence reintroduces depth's serial latency, so the live design move is to spread some of that recurrence across parallel latent paths.

There's a deeper reason this works at all. Pruning studies show neural networks naturally break compositional tasks into isolated, modular subnetworks — and pretraining makes that modular structure more reliable Do neural networks naturally learn modular compositional structure?. If the 'abstraction composing' that depth provides is really modular subroutines being chained, then a recurrent latent block that re-applies and re-composes those modules is a plausible stand-in for stacking dedicated layers. A related instinct shows up in memory architectures: Titans separates short-term attention from a compressed long-term memory module, scaling capability by adding a different kind of computation rather than more of the same layers Can neural memory modules scale language models beyond attention limits?.

The honest summary: the corpus supports an optimistic 'yes, partly.' Latent recurrence is a credible way to capture depth's compositional payoff while sidestepping its parameter cost — but it inherits depth's serial latency unless you blend in width, and none of these notes report a clean apples-to-apples trainability comparison against a deep baseline. The interesting thing you may not have come looking for is that 'depth,' 'width,' 'latent thought,' and 'memory' aren't four separate scaling stories here — they're four levers on the same underlying compositional machinery.

Sources 5 notes

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can latent recurrence overcome the trainability costs of depth?

Sources 5 notes

Next inquiring lines