Can reasoning systems scale wider instead of only deeper?
Explores whether sampling multiple parallel latent trajectories offers a faster scaling path than recursive refinement alone. Matters because it could unlock latency-efficient reasoning at test time.
Recursive Reasoning Models (RRMs) increase reasoning capability by iterating a shared transition function over a latent state — more iterations means more "thinking" without extending the output sequence. This is depth scaling, and it decouples reasoning depth from both parameter count and output length. GRAM (Generative Recursive reAsoning Models) argues this is only half the story: depth alone is insufficient because a single refinement path can become trapped in a suboptimal trajectory, and many problems have ambiguity or multiple valid solutions that a single converging path cannot represent.
The structural claim is that future recursive reasoners should be not only deep (repeated refinement) but also wide (maintaining and exploring multiple latent trajectories in parallel). GRAM operationalizes width by turning the latent transition stochastic and sampling several trajectories simultaneously. Crucially, width sidesteps the latency penalty that depth-only scaling incurs: sampling N trajectories runs in parallel, whereas adding N refinement steps is serial and accumulates wall-clock time.
This reframes the inference-scaling design space for latent architectures. It mirrors at the latent-state level what parallel-vs-sequential debates established at the token level — since Why does parallel reasoning outperform single chain thinking?, breadth often beats depth under a fixed budget because independent paths sample the solution distribution rather than inflating variance along one path. GRAM brings that lesson inside the recurrent block, where prior work like Can models reason without generating visible thinking tokens? had only scaled depth. The counterpoint to watch: since Can parallel architectures solve inherently sequential problems?, width cannot substitute for depth on inherently serial problems — the two axes are complements, not interchangeable knobs. Why it matters: it gives latent reasoning a second, latency-cheap scaling dimension and explains why deterministic RRMs underperform on multi-solution tasks.
— "Generative Recursive Reasoning", https://arxiv.org/abs/2605.19376
Related concepts in this collection
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the token-level analogue: breadth beats depth under fixed budget because independent paths sample the distribution
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
the depth-only RRM baseline GRAM extends with a width axis
-
Can parallel architectures solve inherently sequential problems?
Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
the limit: width cannot replace depth on inherently sequential problems
-
Can we explore multiple reasoning paths without committing to one token?
Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
a related parallel-exploration mechanism, but in concept-token space rather than recurrent latent space
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
reasoning systems should scale in width by sampling parallel latent trajectories not only in depth