Can width-scaling replace depth-scaling on inherently sequential problems?
This explores whether you can get away with making a model wider (more parallel paths, more sampling) instead of deeper (more serial layers/steps) when a problem's steps genuinely depend on each other in order — and the corpus says width buys you a lot, but it can't substitute for depth where each step needs the previous one's output.
This explores whether width-scaling (parallel sampling, more breadth) can stand in for depth-scaling (serial layers, sequential computation) on problems where step N truly depends on step N-1. The short version the corpus suggests: width and depth solve different problems, and on *inherently sequential* tasks, depth is doing work width structurally cannot.
Start with what depth actually buys you. For small models, deep-and-thin architectures beat wide-and-balanced ones because layers *compose* — each layer builds an abstraction on top of the one below, and you can't spread that across width Does depth matter more than width for tiny language models?. The hard ceiling shows up in complexity terms: a fixed-depth transformer is stuck in the AC0/TC0 class, which is exactly why chain-of-thought collapses on Sudoku and mazes — tasks that are nothing but long chains of dependent steps. Coupling slow planning with fast computation across two timescales (effective recurrence) escapes that ceiling with a tiny 27M-parameter model Can recurrent hierarchies achieve reasoning that transformers cannot?. The lesson: when the problem is sequential, you need *more effective depth*, and adding width doesn't grant it.
The sharpest negative evidence is that LLMs simply can't run iterative procedures in latent space at all — faced with an optimization that requires looping toward a solution, they pattern-match a template and emit plausible-but-wrong numbers, and this fails across every scale tested Do large language models actually perform iterative optimization?. That's the signature of a missing serial mechanism, not a missing-parameters problem. It rhymes with the plateau on constrained optimization, where satisfaction stalls at ~55–60% regardless of parameter count or architecture — a structural ceiling, not a scaling gap Do larger language models solve constrained optimization better?.
So where *does* width win? Precisely where the work is parallelizable. Reasoning systems scale efficiently by sampling parallel latent trajectories, dodging the serial latency cost of depth-only scaling when independent paths can explore the solution space at once Can reasoning systems scale wider instead of only deeper?. And allocating test-time compute to *diverse abstractions* enforces a breadth-first search that beats deep-but-narrow chains, which otherwise 'underthink' by committing early to one line Can abstractions guide exploration better than depth alone?. Notice the pattern: width helps with *search and exploration* (which candidate strategy?) — an embarrassingly parallel question — but the moment you've picked a strategy and have to *execute* its dependent steps, you're back in serial territory.
The thing you might not have known you wanted to know: depth isn't really about layer count, it's about *composability under dependency*. Recursive subtask trees that internalize their own structure can sustain reasoning past context limits by pruning the cache between dependent stages Can recursive subtask trees overcome context window limits?, and the long-context bottleneck turns out to be the *compute to consolidate* prior context into state, not raw memory Is long-context bottleneck really about memory or compute?. Both reframe 'sequential' as 'needs serial transformation steps.' Width gives you more guesses; depth gives you the ability to build on your last answer. On inherently sequential problems, more guesses don't add up to a chain — so width complements depth, but doesn't replace it.
Sources 8 notes
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.