LLM Reasoning and Architecture Language Understanding and Pragmatics Reinforcement Learning for LLMs

Does more thinking time actually improve LLM reasoning?

The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The "more thinking = better reasoning" assumption drives major product and research decisions — model releases tout extended thinking modes, inference infrastructure is built around longer traces, researchers benchmark scaling behavior assuming monotonic improvement. But the assumption is directly falsifiable with a controlled experiment, and the data falsifies it.

From ~1,100 to ~16,000 thinking tokens: accuracy drops from 87.3% to 70.3%. The relationship is non-monotonic. Beyond a threshold, more tokens actively hurt.

What makes this a myth rather than just an approximation: it's not that the assumption is wrong at the edges. It's that the assumption was never justified by evidence — it was inferred from partial data (the improving phase of the curve, before the critical point) and then treated as a general truth. The full curve was hidden in plain sight.

The myth persists partly because it maps onto how we think about human reasoning: more reflection should produce better answers. But LLM reasoning traces aren't human reflection. They're stochastic sequences where entropy (variance) and quality (correctness) are different dimensions. Conflating them is a category error. Why do LLMs generate more novel research ideas than experts? shows the same error running in the opposite direction: the intuition that LLMs fall short on creative originality also gets empirically reversed — LLMs generate more novel research ideas than human experts, but lack the evaluative capacity to select good ones. Same structure: cognition-imported intuition meets data, intuition loses.

Post-worthy angle: the overthinking finding is a case study in how intuitions about human cognition, imported uncritically into AI evaluation, generate systematic errors in how we build and measure these systems.

The NoThinking finding adds a sharper falsification at the model level: Even within reasoning models, bypassing the explicit thinking process entirely (NoThinking — forcing the thinking box to be empty) outperforms standard thinking across 7 diverse reasoning datasets when token count is controlled. The performance advantage of reasoning models may come partly from the token budget itself rather than from the structured thinking process. If NoThinking matches or beats Thinking at equal tokens, the thinking box is not doing uniquely valuable work — it may be providing a space to generate tokens that helps the model reach answers, rather than implementing a genuine reasoning process.

AbstentionBench adds a third dimension to this falsification: reasoning fine-tuning doesn't just produce diminishing token-level returns; it actively degrades calibration, reducing abstention rates by 24%. The "more thinking" myth operates at two timescales — inference-time (more tokens hurt past threshold) and training-time (reasoning fine-tuning hurts epistemic calibration). The cost of optimizing for reasoning performance is paid not just in overthinking but in lost capacity to recognize the limits of that reasoning. Does reasoning fine-tuning make models worse at declining to answer? documents this training-time dimension.


Source: Test Time Compute

Related concepts in this collection

Concept map
24 direct connections · 206 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

the more thinking is always better assumption is llms most testable falsifiable myth