Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models

Paper · arXiv 2506.04210 · Published June 4, 2025

This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to ‘overthinking’. To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance—creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from “more thinking” are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking.

Incomplete picture of test-time scaling. While appealing, the above narrative in prior works presents an incomplete picture. In contrast to prior claims, our empirical investigation uncovers a nuanced phenomenon: extending thinking at test-time initially boosts model accuracy, but performance degrades subsequently with prolonged thinking (cf. Section 2.3). This non-monotonic behavior (a clear pattern consistent across various tasks and datasets (cf. Figure 2) reveals the presence of a critical point in the length of the thinking trace beyond which performance declines, which we call ‘overthinking’, largely unrecognized by existing research. These observations raise a fundamental question: Why does additional thinking beyond a certain point degrade the model performance?

Understanding overthinking: a variance-based explanation. To answer the above question, we take a step back and analyze a simple one-dimensional probabilistic framework (cf. Section 3), examining how changes in the variance of the sampling distribution affect the expected value of a target reward. Interestingly, we observe that as the variance increases from low to high, the expected reward exhibits a similar non-monotonic pattern: initially increasing, then decreasing (cf. Figure 4). Inspired by this insight, we empirically assess the variance of reasoning-model outputs under extended thinking by measuring the entropy of their output distributions. Our results clearly demonstrate that extended thinking significantly increases the variance of response distribution (cf. Figure 5). This explains why average accuracy first improves and then deteriorates, revealing that the apparent gains from extended thinking reflect an illusion rather than genuine improvements in reasoning capability.

Overthinking is inefficient for test-time scaling under a fixed budget. These insights reveal a deeper inefficiency: extending a single reasoning trace is not an optimal use of the test-time compute budget. Because performance does not improve monotonically with more tokens, there is no reliable stopping criterion, making this strategy brittle in practice.

A fix: parallel thinking as a principled alternative. To overcome these limitations, we propose parallel thinking, a test-time scaling strategy inspired by Best-of-N sampling (Beirami et al., 2024; Amini et al., 2024; Nakano et al., 2021; Stiennon et al., 2020; Gui et al., 2024; Jinnai et al., 2024). Instead of continuing one thinking trace, we allocate the same token budget across multiple independent thinking paths and select the final answer via majority voting. This approach avoids entropy overgrowth, mitigates the overthinking trap, and achieves significantly better performance. For example, under a 16K token budget, parallel thinking yields up to 22% higher accuracy compared to sequential scaling (Figure 6). We summarize our contributions as follows.

(i) Empirical diagnosis of overthinking: We investigate test-time scaling by encouraging extended thinking in state-of-the-art reasoning models with prompts as “Wait”, “Think more” etc. It reveals a consistent non-monotonic trend in performance across multiple tasks and datasets (cf. Section 2). (ii) Illusion of test-time scaling: alternative explanation: We provide an interesting explanation for the non-monotonic trend of test-time scaling in reasoning models through a simple probabilistic framework. Our analysis clarifies why extending reasoning initially improves performance but eventually leads to degradation, highlighting variance as the key driver of the observed non-monotonic behavior (cf. Section 3).

(iii) Variance-driven explanation of performance degradation in reasoning models: By analyzing the entropy of the response distribution generated by the reasoning models, supported by our insights, we show that extended thinking increases the variance of the model’s output distribution. While this variance increase initially aligns with improved performance, it eventually disrupts reward alignment, explaining the degradation observed beyond a certain point (cf. Section 3.1).

(iv) Effective budget-control via parallel thinking: We propose an alternative test-time scaling strategy, parallel thinking, inspired by Best-of-N sampling. By simultaneously generating multiple independent reasoning paths, this approach circumvents the pitfalls of sequential overthinking and yields higher performance, demonstrating genuine self-improvement capabilities (cf. Section 4). This approach outperforms overthinking across all benchmarks and provides a reliable mechanism for inference-time scaling (Figure 6).

2.1 Test-Time Budget Control (TTBC) To systematically probe how test-time budgets modulate model behavior, we apply two different budget control approaches on the model’s thinking tokens (Muennighoff et al., 2025) detailed as follows.

• TTBC 1: Wait & Think more. In this approach, we do not impose any explicit budget constraint on the number of thinking tokens, apart from the model’s inherent maximum token limit (32K). Specifically for this approach, whenever the model attempts to generate the end-of-thinking delimiter (), we suppress it, append the token “Wait” to the thinking trace, and feed the modified trace back to the model. This intervention is applied iteratively, encouraging the model to extend its thinking. In this setup, there is no explicit budget for the thinking tokens, except the max token limit of the model, and we only modify the number of times “Wait” is appended to the thinking trace.

• TTBC 2: Exact thinking tokens. We enforce an exact thinking token budget of texact for each reasoning trajectory. Specifically, we iteratively append the token “Wait” to the reasoning trace until the cumulative count of thinking tokens reaches exactly texact. Once this threshold is reached, we terminate the thinking stage by allowing the end-of-thinking delimiter () to pass, signaling the model to generate the final response. This approach ensures that every reasoning trajectory is precisely constrained to texact tokens. For this setup, we vary texact across [256, 512, 1024, 2048, 4096, 8192, 16384].

Beyond a critical point, further increasing the thinking budget results in a steady decline in accuracy. Specifically, pushing the average thinking token count from 1100 to 15980 reduces accuracy from 87.3% to 70.3%. This observed non-monotonic trend challenges the prevailing assumption that “more thinking is always better." Instead, it reveals a more nuanced insight: test-time reasoning exhibits a critical spot, a point beyond which additional thinking transitions from helpful to not being helpful anymore- a phenomenon we call ‘overthinking’. Prior work has overlooked this degradation phase, presenting an incomplete view of the true test-time scaling landscape. To further understand and explain this phenomenon, we extend our analysis in Figure 3 using the Exact Thinking tokens setup (TTBC 2).

These findings consistently reaffirm the same trend, underscoring the importance of reconsidering current test-time reasoning strategies and moving beyond the simplistic belief that more computation inherently leads to better reasoning outcomes.

In Appendix C.1, we extend this analysis to the Minimum Thinking tokens TTBC, and observe a similar non-monotonic relationship between accuracy and the number of thinking tokens.

However, this trend does not continue indefinitely, as performance begins to degrade beyond a certain point with more increase in variance. When the variance is too small, the model remains stuck near the proposal mean, resulting in poor reward due to limited exploration. Conversely, when the variance is too large, the model samples indiscriminately across the space, again leading to poor reward. Thus there exists a critical point in the variance of the proposal distribution.

Why does the critical point exist? It arises because of two competing forces: Coverage effect: For small σ2π , increasing variance improves average reward by covering more of the reward peak centered at μr. Dilution effect: Beyond a point, increasing variance overspreads the distribution, placing mass on regions far from μr, leading to diminished expected reward. The trade-off is evident in equation 4, where initially, the exponential term dominates: increasing σ2π helps reduce the exponent (denominator increases), improving expected reward. Eventually, the prefactor 1/ p 2π(σ2 r + σ2π ) shrinks faster than the exponent gains, reducing the overall value.

Connection with the test-time scaling in reasoning models. Our simple illustration reveals a powerful insight: increasing the variance of the policy distribution can initially boost expected reward, not because the policy has improved, but due to greater overlap with the reward. Crucially, this improvement is an illusion, driven by randomness rather than genuine policy refinement. We hypothesize that a similar effect underlies the observed gains in test-time scaling of reasoning models observed in Figure 2. Specifically, increasing thinking via longer thinking traces with prompts like “Wait” acts as a knob to affect the variance of the model’s output distribution.

In the next subsection, we draw this analogy directly: each additional reasoning/thinking step increases the entropy of the policy, leading to broader sampling and a rise in accuracy, up to a point. Beyond that, the distribution becomes too diffuse, and performance deteriorates.