Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

Paper · arXiv 2505.21825 · Published May 27, 2025
Reasoning Methods CoT ToTInference time scaling

Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.1

(1) Parallel scaling refers to generating multiple independent responses in parallel, and aggregating them in some way to output the final solution [BJE+24, WWS+23]. The most common aggregation technique is “best-of-n”, where a reward function (e.g. another language model or a dedicated verifier [BJE+24]) selects the single highest-scoring response as the output. Another widely used aggregation method is majority voting, which determines the final response by choosing the most frequent one among all generated responses [BJE+24].

(2) Sequential scaling encompasses all techniques that do not fall under parallel scaling. The flagship method in this category is Chain-of-Thought (CoT) [WWS+22, NAGA+22, KGR+23], in which an LLM first outputs a chain of reasoning tokens, before outputting its final answer. This may be achieved with one of several strategies to induce longer chains of reasoning in LLMs, such as adding a prompt instruction to “think step by step” [KGR+23], or forcing a longer chain of thought by replacing end-of-text tokens with “Wait” [MYS+25], or training with reinforcement learning objectives which can automatically induce longer chains of thought [GYZ+25].

Reasoning task Our reasoning task is a variant of the basic graph connectivity task. Solving it requires determining whether pairs of vertices are connected by stepping through several edges, and thus it serves as a proxy task modeling multi-step reasoning with CoT on more naturalistic data. Details are in Section 2.

Our task is motivated by a growing theoretical literature on the limitations and capabilities of transformers on graph reasoning tasks [XLZ+20, SFH+24, ABL+24, BBK+24, KWLS25, SHT24], which have found that graph connectivity is challenging for bounded-depth transformers. The reason for this is that graph computation appears to be a sequential problem, yet the transformers’ sequential computation is bounded by their depth, as proved in expressivity results [MS24a, MS23, Chi24].