Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning
Test-time scaling, which is also often referred to as slow-thinking, has been demonstrated to enhance multi-step reasoning in large language models (LLMs). However, despite its widespread utilization, the mechanisms underlying slowthinking methods remain poorly understood. This paper explores the mechanisms of external slow-thinking from a theoretical standpoint. We begin by examining the snowball error effect within the LLM reasoning process and connect it to the likelihood of correct reasoning using information theory. Building on this, we show that external slow-thinking methods can be interpreted as strategies to mitigate the error probability. We further provide a comparative analysis of popular external slow-thinking approaches, ranging from simple to complex, highlighting their differences and interrelationships. Our findings suggest that the efficacy of these methods is not primarily determined by the specific framework employed, and that expanding the search scope or the model’s internal reasoning capacity may yield more sustained improvements in the long term.
Specifically, empirical studies have shown that the reasoning quality of LLMs improves with extended inference time (Lightman et al., 2023). This observation has sparked a new research trajectory focused on augmenting the reasoning abilities of LLMs by increasing inference costs during the test-time phase, a concept referred to as test-time scaling, or more colloquially, slow-thinking.
Test-time scaling strategies can be generally classified into two primary approaches: internal and external slowthinking (Jiang et al., 2024; Min et al., 2024). Internal slow-thinking involves adjusting model parameters through additional training on specifically designed reasoning tasks, aiming to inherently extend the model’s output length and thereby enhance its reasoning capabilities. In contrast, external slow-thinking focuses on increasing inference costs by introducing additional computational steps, such as re-sampling or re-generating model outputs multiple times (Brown et al., 2024), thereby prolonging inference time and improving reasoning quality.
This paper focuses on external slow-thinking techniques, which are inspired by human cognitive processes. When facing complex questions, humans often take extra time to reflect and refine their intermediate answers, leading to greater accuracy. Similarly, external slow-thinking methods, such as the Best-of-N (BoN) strategy, draw multiple samples and evaluate them using techniques like majority voting or ranking (Cobbe et al., 2021). Beyond simpler methods, advanced frameworks like CoT (Wei et al., 2022), ToT (Yao et al., 2024), and MCTS-based approaches inspired by AlphaGo (Silver et al., 2016) explore solution spaces in tree
Despite their promise, external slow-thinking methods face several challenges. First, the mechanisms behind their effectiveness remain poorly understood, hindering the design of more advanced and efficient strategies. Second, practical implementations of complex slow-thinking techniques often achieve limited success unless significant computational resources are added. This is due to the difficulty of optimizing design choices and hyperparameters, which frequently results in suboptimal performance.
further illustrates the mechanism of external slow-thinking methods. By scaling k, these methods can improve the probability of correct reasoning at the cost of additional reasoning steps. However, the effectiveness heavily depends on the reliability of the value function,
By expanding the reasoning space, these methods effectively increase the probability of generating a correct response, thereby mitigating the impact of snowball errors. However, selecting the most promising reasoning path poses a significant challenge, as the effectiveness of this selection heavily depends on the reliability of the employed value function, which can substantially influence the overall performance of the method.
We begin by determining the probability of correct reasoning for BoN and MCTS using the results from Theorem 4.6. With the assumption that the total number of reasoning steps is L, BoN can be characterized as generating N reasoning steps in the first layer, followed by the generation of a single step in subsequent layers, and finally applying a reward model (RM) to select one path from the N candidates in the L-th layer.
In contrast, MCTS employs a more intricate structure, making it difficult to derive a closed-form expression for the probability of correct reasoning. To simplify the analysis, we consider the “best-case” and “worst-case” scenarios for MCTS. Here, the “best” and “worst” cases are defined based on the difficulty for BoN to achieve a comparable probability of correct reasoning, rather than the actual performance of MCTS.
In conclusion, external slow-thinking methods introduce additional reasoning steps to mitigate the impact of snowball errors. However, on one hand, the inaccuracy of the reward function can result in the additional reasoning steps incurring extra selection costs, which may decrease the probability of correct reasoning. On the other hand, the effectiveness of mitigating snowball errors is primarily determined by the total reasoning cost, with the specific framework having limited impact on the overall outcome.