Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
The prevailing assumption that "more thinking tokens = better reasoning" is empirically false beyond a critical point. Pushing the average thinking token count from ~1,100 to ~15,980 reduced accuracy from 87.3% to 70.3% on the same benchmark.
This non-monotonic relationship — initial improvement followed by steady decline — is consistent across multiple tasks and datasets. The researchers call the degradation phase "overthinking," and it has been largely invisible in prior work because most studies only reported the improving phase of the curve.
The practical implication: there is a sweet spot, and token budgets above it actively harm performance. Current practice of using "more tokens" as a proxy for "more reasoning" is not just wasteful — it is counterproductive past the threshold. Since Does extended thinking actually improve reasoning or just increase variance?, the gains before the threshold aren't even what they appear to be.
The bidirectional calibration failure (Between Underthinking and Overthinking): The relationship is not just non-monotonic — models miscalibrate in both directions. For easy questions, models often detect difficulty increases and extend reasoning appropriately. But for hard questions beyond their capability, models underthink — failing to recognize difficulty or lacking the knowledge to respond effectively, producing responses shorter than needed. The result: models overthink easy problems (generating unnecessarily long outputs) and underthink hard ones (failing to extend reasoning when most needed).
Length-based preference optimization provides a surprising intervention: fine-tuning to prefer shorter responses — using only unlabeled data, without ground-truth labels — maintains relatively strong accuracy while reducing token length. The reduction is disproportionately from incorrect responses (which are significantly longer), but 10-25% reduction on correct responses is also observed. This suggests models have latent ability to calibrate difficulty for easy problems but retain an overthinking tendency that preference optimization can reduce.
PI framework: the attention-level mechanism behind the threshold: The PI (Test-time Prompt Intervention) framework provides the attention-level mechanism that explains why the threshold exists. Visualizing attention maps across reasoning steps reveals that verification and backtracking steps (e.g., steps 7-8 in a typical trace) receive minimal subsequent attention — the model generates them but barely reads them. After generating the correct answer step, all following steps predominantly attend to that pivotal moment rather than to intermediate verification. The critical steps — those whose predecessors all receive high attention — can reproduce the reasoning with 75% fewer steps. This transforms the behavioral observation (accuracy degrades with more tokens) into a mechanistic explanation: redundant tokens are attention-invisible, contributing neither signal nor structure to the final answer. The overthinking region is precisely where token generation has detached from the attention graph that actually drives outputs. Source: Prompts Prompting.
Optimal reasoning token ratio exists but models cannot reach it. ZebraLogic's analysis of constraint satisfaction problems shows that there exists an optimal ratio of reasoning tokens to problem complexity (measured by Z3 solver conflicts). O1-like models scale reasoning tokens with complexity and approach this optimal ratio for moderate problems, but cannot reach it when complexity is extremely high — the reasoning effort ceiling is below what the problem requires. Self-verification prompting provides only marginal improvement (31.7% → 33.0% → 32.1% on second iteration), suggesting the bottleneck is not insufficient verification but insufficient reasoning depth. The optimal ratio finding quantifies the threshold: the sweet spot is not just "not too many tokens" but a specific relationship between problem difficulty and reasoning budget.
S1-Bench (2025) reveals that LRMs can prejudge question simplicity — especially in Chinese — but thinking length does NOT shorten despite this prejudgment. Models generate unnecessary solution rounds after reaching the correct answer, repeatedly reverifying simple problems already solved. Models with longer thinking processes produce more excessive solution rounds. Furthermore, LRMs sometimes include incorrect intermediate conclusions in their reasoning even when ultimately reaching correct final answers, and sometimes reach the correct answer during reasoning but then deviate to produce incorrect final conclusions. The prejudgment finding is architecturally important: it suggests the overthinking mechanism is not caused by inability to assess difficulty, but by an inability to act on that assessment — the model "knows" the problem is simple but cannot truncate its reasoning accordingly. Source: Arxiv/Evaluations.
Source: Test Time Compute; enriched from Flaws
Related concepts in this collection
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
the mechanistic explanation for why this threshold exists
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the alternative strategy that avoids the overthinking trap
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
supporting evidence from a different angle
-
Do reasoning models switch between ideas too frequently?
Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
the complementary failure mode: insufficient depth per path, not just excessive total tokens
-
Can dialogue planning balance fast responses with strategic depth?
Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
DPDP naturally avoids the overthinking threshold by restricting deep search (MCTS) to genuinely uncertain contexts via System 1/2 switching
-
Do personality types shape how AI agents make strategic choices?
This research explores whether priming LLM agents with MBTI personality profiles causes them to adopt different strategic behaviors in games. Understanding this matters for designing AI systems optimized for specific tasks.
personality priming modulates reasoning depth: Introversion produces longer, more elaborated rationales, potentially lowering the threshold at which overthinking degrades accuracy; personality conditioning is an unexamined variable in test-time compute allocation
-
When should retrieval happen during model generation?
Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
retrieval-level analog: just as reasoning tokens past the threshold harm accuracy, retrieval at every step regardless of confidence wastes context and introduces noise; both findings argue for uncertainty-gated resource allocation rather than fixed budgets
-
Do large language models use one reasoning style or many?
Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
cross-domain confirmation: in strategic games, top performers produce shortest CoT in their strongest game types while DeepSeek-R1 exhibits "repeated self-doubt" loops in competitive games that inflate tokens without improvement — the overthinking threshold extends to interactive reasoning
-
Why do reasoning models struggle with theory of mind tasks?
Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
the overthinking threshold is categorically worse for social reasoning
-
Can we measure how deeply a model actually reasons?
What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
DTR explains the threshold mechanistically: tokens past the threshold should have low DTR (early layer stabilization = pattern-matching filler rather than genuine computation); Think@n provides a selection mechanism that avoids the overthinking region: reasoning effort shows zero or negative correlation with ToM performance, meaning extended thinking actively degrades social cognition rather than merely plateauing — social tasks may have a near-zero optimal thinking threshold
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
reasoning accuracy degrades beyond a critical thinking-token threshold