LLM Reasoning and Architecture Reinforcement Learning for LLMs

How should we balance parallel versus sequential compute at test time?

Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?

Note · 2026-02-20 · sourced from Test Time Compute

Every approach to test-time compute lands somewhere on the parallel-sequential axis:

Parallel: Sample multiple responses independently, aggregate (Best-of-N, majority voting, mixture of agents). Improves coverage — the chance of including the right answer in the candidate set.
Sequential: Extend or refine a single chain iteratively (chain-of-thought, self-revision, step-by-step refinement). Allows depth — exploring one promising line of reasoning fully.
Hybrid: Use parallel sampling at key decision points and sequential reasoning within branches (Tree of Thoughts, beam search, MCTS). Tries to balance exploration and exploitation.

The pattern recurs consistently across papers, architectures, and tasks. The trade-off between coverage and depth is not a special feature of any one method — it's a fundamental tension in how to allocate finite compute.

Empirical evidence increasingly favors parallel approaches on general benchmarks (see Why does parallel reasoning outperform single chain thinking?), but the field's intuition still leans sequential because it maps onto human reasoning patterns. The disconnect between what works and what feels right is part of what makes the overthinking findings surprising.

The exponential counter-case: On structured compositional problems where solutions require sequential accumulation of intermediate results (graph connectivity, deep multi-hop chains), sequential CoT is exponentially better than parallel voting. See When does sequential reasoning beat parallel voting?. This resolves the apparent contradiction: parallel wins when independent short attempts can each reach an answer; sequential wins when the problem requires depth that short chains cannot achieve at all. Task structure is the moderating variable.

Training format as an upstream determinant: Does training data format shape reasoning strategy more than domain? shows that multiple-choice training produces BFS-like (parallel-resembling) reasoning; free-form training produces DFS-like (sequential) reasoning. The parallel/sequential trade-off plays out at training time too — format determines which pole a model's default reasoning strategy occupies before any inference-time decisions are made.

Retrieval-level parallel/sequential trade-off: RAG-R1 demonstrates the parallel/sequential dichotomy at the retrieval level. Single-query mode requires sequential multi-turn retrieval rounds; multi-query parallelism issues multiple queries simultaneously, reducing retrieval rounds and improving information diversity. The same structural trade-off — coverage (parallel) vs depth (sequential) — appears in RAG system design, not just reasoning token allocation.

Complexity-theoretic foundation — the Serial Scaling Hypothesis: Can parallel architectures solve fundamentally sequential problems? provides the theoretical grounding: inherently serial problems (mathematical reasoning, physical simulation, planning) cannot be solved by parallel architectures. Transformers and even diffusion models are in TC0 — provably incapable of solving inherently serial problems regardless of compute. This reframes the trade-off: it's not just empirical (which works better) but formal (some problems require serial computation). The parallel-wins finding applies to parallelizable problems; the serial hypothesis identifies problems where parallel is provably insufficient.

Evolutionary inference as a third mode: Mind Evolution introduces population-based search at inference time — neither pure parallel sampling nor sequential refinement, but iterative evolution of diverse candidate populations. See Can evolutionary search beat sampling and revision at inference time?. The island model sustains diversity that single-trajectory refinement loses, while the genetic recombination creates candidates that independent sampling cannot reach. This suggests the parallel/sequential axis may be insufficient — population-based methods occupy a distinct region of the design space.

Source: Test Time Compute, Novel Architectures

Related concepts in this collection

Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
empirical resolution of this trade-off
How should we categorize different test-time scaling approaches? Test-time scaling research spans multiple strategies for improving model performance at inference. Understanding how these approaches differ—and how they relate—helps researchers and practitioners choose the right method for their constraints.
a related but distinct dichotomy
Does voting discard useful reasoning from losing chains? When multiple reasoning chains compete through majority voting, intermediate steps from non-winning chains are discarded. Could extracting and mixing those intermediate facts improve both the final answer and our ability to understand the reasoning?
refines the parallel endpoint: after parallel sampling, voting is the wrong aggregation; meta-reasoning over intermediates extracts more value from parallel chains
Can parallel architectures solve fundamentally sequential problems? Explores whether pure parallel computation—like Transformers—can tackle problems requiring long chains of dependent reasoning, or if serial depth is theoretically necessary for certain classes of problems.
complexity-theoretic foundation: some problems provably require serial computation
Can evolutionary search beat sampling and revision at inference time? Can LLMs evolve populations of solutions through recombination and selection to outperform simpler inference strategies? This matters because it could reveal whether biological-inspired search improves planning without formal problem definitions.
third mode: population-based evolution transcends the parallel/sequential dichotomy
Does planning backward help when goals have bottlenecks? Can language models exploit structural asymmetries in planning problems by reversing the search direction? This matters because most planning research assumes forward-only generation, potentially missing efficiency gains when bottlenecks constrain early possibilities.
a fourth diversity dimension beyond parallel/sequential/evolutionary: directional parallelism generates diverse candidates by planning both forward and backward, exploiting problem-specific asymmetries where bottlenecks near the goal make backward search easier
Does network depth unlock qualitatively new behaviors in RL? Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.
a third scaling axis: depth-scaling produces qualitative capability jumps (walking at 16 layers, wall-climbing at 256) that neither parallel breadth nor sequential extension can achieve; depth may be an independent dimension alongside the parallel-vs-sequential trade-off
Can retrieval be scaled like reasoning at test time? Standard RAG retrieves once, but multi-hop tasks need adaptive retrieval. Can we train models to plan retrieval chains and vary their length at test time to improve accuracy, the way test-time scaling works for reasoning?
CoRAG instances the parallel/sequential trade-off at the retrieval level: parallel chains (best-of-N) vs. sequential decoding vs. tree search; the same coverage-vs-depth tension recurs in retrieval just as it does in reasoning

Concept map

24 direct connections · 186 in 2-hop network ·medium cluster

How should we balance parallel versus sequential… Why does parallel reasoning outperform single chai… How should we categorize different test-time scali… Does voting discard useful reasoning from losing c… Can parallel architectures solve fundamentally seq… Can evolutionary search beat sampling and revision… Does planning backward help when goals have bottle… Does network depth unlock qualitatively new behavi… Can retrieval be scaled like reasoning at test tim…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

parallel vs sequential scaling is the recurring trade-off in test-time compute