LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can parallel architectures solve fundamentally sequential problems?

Explores whether pure parallel computation—like Transformers—can tackle problems requiring long chains of dependent reasoning, or if serial depth is theoretically necessary for certain classes of problems.

Note · 2026-02-23 · sourced from Novel Architectures

The distinction between problems that are "wide" (parallelizable) and "deep" (inherently serial) is fundamental to computational complexity theory but underappreciated in machine learning. The Serial Scaling Hypothesis formalizes this: for many important ML problems — especially those involving complex reasoning, planning, or evolution of interacting systems — increasing parallel computation alone is insufficient. Progress requires scaling serial computation.

The complexity-theoretic argument is precise. Transformers and state-space models, despite processing "sequences," are parallelizable computations. Unless used with autoregressive inference (one token at a time), a Transformer layer ingests all N tokens concurrently. This places them in computational complexity classes like AC0 or TC0, preventing them from solving problems requiring polynomial time.

The most striking finding: diffusion models cannot solve inherently serial problems. Despite appearing non-parallelizable due to iterative denoising, diffusion models with a fixed-depth TC0 backbone remain in TC0 even with infinite diffusion steps. The iterative nature of diffusion is a red herring — the computational depth per step is what matters, and that is bounded.

Consider Sudoku as a parable. Easy puzzles can be solved by filling blanks independently — parallel. Hard puzzles require long chains of dependent reasoning where each blank depends on others — serial. No algorithm can shortcut this dependency chain. The same distinction applies to mathematical reasoning, physical simulation, and sequential decision-making.

The implications cut three ways:

  1. Model design: Future architectures need recurrent structures to increase serial computation, not just wider parallel ones. But adding recurrence introduces trainability concerns — gradient variance increases and Lipschitz constants grow, making training harder.
  2. Hardware: The field may need faster, lower-latency processors, not just more parallel ones.
  3. Evaluation: Serial compute should be measured and reported separately from total compute.

This provides the complexity-theoretic foundation for why Can models reason without generating visible thinking tokens? matters — recurrent depth-scaling is not just an efficiency trick but a necessity for inherently serial problems that parallel architectures provably cannot solve.


Source: Novel Architectures

Related concepts in this collection

Concept map
15 direct connections · 144 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

the serial scaling hypothesis — some problems are fundamentally sequential and parallel architectures cannot solve them