LLM Reasoning and Architecture Language Understanding and Pragmatics Reinforcement Learning for LLMs

Does more thinking time actually improve LLM reasoning?

The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?

Note · 2026-02-20 · sourced from Test Time Compute

The "more thinking = better reasoning" assumption drives major product and research decisions — model releases tout extended thinking modes, inference infrastructure is built around longer traces, researchers benchmark scaling behavior assuming monotonic improvement. But the assumption is directly falsifiable with a controlled experiment, and the data falsifies it.

From ~1,100 to ~16,000 thinking tokens: accuracy drops from 87.3% to 70.3%. The relationship is non-monotonic. Beyond a threshold, more tokens actively hurt.

What makes this a myth rather than just an approximation: it's not that the assumption is wrong at the edges. It's that the assumption was never justified by evidence — it was inferred from partial data (the improving phase of the curve, before the critical point) and then treated as a general truth. The full curve was hidden in plain sight.

The myth persists partly because it maps onto how we think about human reasoning: more reflection should produce better answers. But LLM reasoning traces aren't human reflection. They're stochastic sequences where entropy (variance) and quality (correctness) are different dimensions. Conflating them is a category error. Why do LLMs generate more novel research ideas than experts? shows the same error running in the opposite direction: the intuition that LLMs fall short on creative originality also gets empirically reversed — LLMs generate more novel research ideas than human experts, but lack the evaluative capacity to select good ones. Same structure: cognition-imported intuition meets data, intuition loses.

Post-worthy angle: the overthinking finding is a case study in how intuitions about human cognition, imported uncritically into AI evaluation, generate systematic errors in how we build and measure these systems.

The NoThinking finding adds a sharper falsification at the model level: Even within reasoning models, bypassing the explicit thinking process entirely (NoThinking — forcing the thinking box to be empty) outperforms standard thinking across 7 diverse reasoning datasets when token count is controlled. The performance advantage of reasoning models may come partly from the token budget itself rather than from the structured thinking process. If NoThinking matches or beats Thinking at equal tokens, the thinking box is not doing uniquely valuable work — it may be providing a space to generate tokens that helps the model reach answers, rather than implementing a genuine reasoning process.

AbstentionBench adds a third dimension to this falsification: reasoning fine-tuning doesn't just produce diminishing token-level returns; it actively degrades calibration, reducing abstention rates by 24%. The "more thinking" myth operates at two timescales — inference-time (more tokens hurt past threshold) and training-time (reasoning fine-tuning hurts epistemic calibration). The cost of optimizing for reasoning performance is paid not just in overthinking but in lost capacity to recognize the limits of that reasoning. Does reasoning fine-tuning make models worse at declining to answer? documents this training-time dimension.

Source: Test Time Compute

Related concepts in this collection

Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the empirical finding
Does extended thinking actually improve reasoning or just increase variance? When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
why the myth persists (it's not entirely wrong, just wrong past the threshold)
Why do LLMs generate more novel research ideas than experts? LLM-generated research ideas are statistically more novel than those from 100+ expert researchers, but the mechanisms behind this advantage and its practical implications remain unclear. Understanding this paradox could reshape how we use AI in creative knowledge work.
parallel structure: another cognitive intuition empirically reversed; intuition says LLMs lack creativity, data says they exceed humans in novelty
Do reasoning traces actually cause correct answers? Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
names the source of the myth: importing the human intuition that reflective thinking improves answers conflates surface mimicry with genuine reasoning; the myth is sustained by trace anthropomorphism even after empirical falsification
Is reflection in reasoning models actually fixing mistakes? Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.
sharpens the falsification: not only does extended thinking fail past a threshold, the reflection stage of reasoning models is mostly confirmatory theater that does not change the first answer; the myth is doubly wrong — extended thinking does not help and the reflection that would justify it is performative
What makes reflection actually work in reasoning models? Does reflection in language models involve genuine self-correction, or just confident-sounding traces? This question probes whether models can truly backtrack and revise versus merely mimicking reflective language.
replaces the bad metric: the myth treats chain length as a proxy for reasoning quality; reflection capability counting (assumption, backtracking, self-refinement) is the proper unit, and most chains contain few of these despite their length

Concept map

24 direct connections · 206 in 2-hop network ·medium cluster

Does more thinking time actually improve LLM rea… Does more thinking time always improve reasoning a… Does extended thinking actually improve reasoning … Why do LLMs generate more novel research ideas tha… Do reasoning traces actually cause correct answers… Is reflection in reasoning models actually fixing … What makes reflection actually work in reasoning m…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

the more thinking is always better assumption is llms most testable falsifiable myth