The Serial Scaling Hypothesis
While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems—from mathematical reasoning to physical simulations to sequential decision-making—require dependent computational steps that cannot be parallelized. Drawing from complexity theory, we formalize this distinction and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, hardware development. As AI tackles increasingly complex reasoning, deliberately scaling serial computation—not just parallel computation—is essential for continued progress.
Consider Sudoku—a number-placement puzzle requiring each number 1–9 to appear once per row, column, and subgrid—as a parable (Figure 1). Easy puzzles can be solved by filling in many blanks independently, in parallel. Hard puzzles, however, require a long chain of dependent reasoning: each blank depends on the others. No algorithm can shortcut the process. This distinction—between problems that are “wide” (parallel) and those that are “deep” (inherently serial)—is fundamental; it dictates how we design efficient models, yet it is underappreciated in machine learning. We propose the Serial Scaling Hypothesis: For many important ML problems, especially those involving complex reasoning, planning, or the evolution of interacting systems, increasing parallel computation alone is insufficient. Progress requires scaling the amount of serial computation.
Why does this matter for machine learning? As we push toward more challenging tasks—advanced reasoning, physical simulations, planning, and scientific discovery—we encounter problems that parallel architectures (like Transformers and diffusion models) cannot efficiently solve. Recognizing the limits of parallel scaling and the necessity of serial computation has several implications:
• Model design: Should we revisit architectures that allow for deeper or more sequential computation?
• Hardware: Is it time to invest in faster, lower-latency processors, not just more parallel ones?
• Evaluation: Should we measure and report serial compute separately from total compute?
Contributions. We (i) formulate the Serial Scaling Hypothesis and connect it to complexity theory, (ii) survey a range of problems—cellular automata, many-body physics, sequential decision-making, and advanced reasoning—where serial computation is provably or empirically essential, (iii) analyze the limitations of modern parallel-centric machine learning models on these tasks, and (iv) discuss implications for future model and hardware design, with practical recommendations.
Transformers and state-space models (SSM) [36, 35] are themselves “sequence” models, yet their computation is parallelizable. Unless used with autoregressive inference, which limits their throughput to one token at a time, a Transformer or SSM layer can ingest the whole N input tokens concurrently and produce the output.
Diffusion models cannot solve inherently serial problems. The surprising finding that is first proven here is that diffusion models—which appear non-parallelizable due to their iterative nature—are incapable of solving inherently serial problems. We demonstrate theoretically in Section 5 that diffusion models [42, 107, 106] with a fixed-depth TC0 backbone network still lie in TC0, i.e., cannot solve inherently serial problems, even with infinite diffusion steps. We provide a summary of models’ theoretical capabilities in Table 2.
For model designers. To solve these challenging real-world problems efficiently, future models may need to incorporate recurrent structures to increase their serial computation, in addition to the current predominantly parallel designs. It is important to note that such designs may come with trainability concerns. For instance, adding recurrence or depth often increases the gradient variance [8, 86] and the L-Lipschitzness of the model [7, 30], which cumulatively make the model harder to train. This further requires improvements in training techniques.