LLM Reasoning and Architecture Reinforcement Learning for LLMs

Does more thinking time always improve reasoning accuracy?

Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The prevailing assumption that "more thinking tokens = better reasoning" is empirically false beyond a critical point. Pushing the average thinking token count from ~1,100 to ~15,980 reduced accuracy from 87.3% to 70.3% on the same benchmark.

This non-monotonic relationship — initial improvement followed by steady decline — is consistent across multiple tasks and datasets. The researchers call the degradation phase "overthinking," and it has been largely invisible in prior work because most studies only reported the improving phase of the curve.

The practical implication: there is a sweet spot, and token budgets above it actively harm performance. Current practice of using "more tokens" as a proxy for "more reasoning" is not just wasteful — it is counterproductive past the threshold. Since Does extended thinking actually improve reasoning or just increase variance?, the gains before the threshold aren't even what they appear to be.

The bidirectional calibration failure (Between Underthinking and Overthinking): The relationship is not just non-monotonic — models miscalibrate in both directions. For easy questions, models often detect difficulty increases and extend reasoning appropriately. But for hard questions beyond their capability, models underthink — failing to recognize difficulty or lacking the knowledge to respond effectively, producing responses shorter than needed. The result: models overthink easy problems (generating unnecessarily long outputs) and underthink hard ones (failing to extend reasoning when most needed).

Length-based preference optimization provides a surprising intervention: fine-tuning to prefer shorter responses — using only unlabeled data, without ground-truth labels — maintains relatively strong accuracy while reducing token length. The reduction is disproportionately from incorrect responses (which are significantly longer), but 10-25% reduction on correct responses is also observed. This suggests models have latent ability to calibrate difficulty for easy problems but retain an overthinking tendency that preference optimization can reduce.

PI framework: the attention-level mechanism behind the threshold: The PI (Test-time Prompt Intervention) framework provides the attention-level mechanism that explains why the threshold exists. Visualizing attention maps across reasoning steps reveals that verification and backtracking steps (e.g., steps 7-8 in a typical trace) receive minimal subsequent attention — the model generates them but barely reads them. After generating the correct answer step, all following steps predominantly attend to that pivotal moment rather than to intermediate verification. The critical steps — those whose predecessors all receive high attention — can reproduce the reasoning with 75% fewer steps. This transforms the behavioral observation (accuracy degrades with more tokens) into a mechanistic explanation: redundant tokens are attention-invisible, contributing neither signal nor structure to the final answer. The overthinking region is precisely where token generation has detached from the attention graph that actually drives outputs. Source: Prompts Prompting.

Optimal reasoning token ratio exists but models cannot reach it. ZebraLogic's analysis of constraint satisfaction problems shows that there exists an optimal ratio of reasoning tokens to problem complexity (measured by Z3 solver conflicts). O1-like models scale reasoning tokens with complexity and approach this optimal ratio for moderate problems, but cannot reach it when complexity is extremely high — the reasoning effort ceiling is below what the problem requires. Self-verification prompting provides only marginal improvement (31.7% → 33.0% → 32.1% on second iteration), suggesting the bottleneck is not insufficient verification but insufficient reasoning depth. The optimal ratio finding quantifies the threshold: the sweet spot is not just "not too many tokens" but a specific relationship between problem difficulty and reasoning budget.

S1-Bench (2025) reveals that LRMs can prejudge question simplicity — especially in Chinese — but thinking length does NOT shorten despite this prejudgment. Models generate unnecessary solution rounds after reaching the correct answer, repeatedly reverifying simple problems already solved. Models with longer thinking processes produce more excessive solution rounds. Furthermore, LRMs sometimes include incorrect intermediate conclusions in their reasoning even when ultimately reaching correct final answers, and sometimes reach the correct answer during reasoning but then deviate to produce incorrect final conclusions. The prejudgment finding is architecturally important: it suggests the overthinking mechanism is not caused by inability to assess difficulty, but by an inability to act on that assessment — the model "knows" the problem is simple but cannot truncate its reasoning accordingly. Source: Arxiv/Evaluations.


Source: Test Time Compute; enriched from Flaws

Related concepts in this collection

Concept map
27 direct connections · 212 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reasoning accuracy degrades beyond a critical thinking-token threshold