Does more thinking always help large language models or sometimes hurt?
This explores whether longer reasoning chains and extended 'thinking' reliably improve LLM performance, or whether more deliberation sometimes backfires.
This explores whether longer reasoning chains and extended 'thinking' reliably improve LLM performance, or whether more deliberation sometimes backfires — and the corpus comes down firmly on "it depends, and often it hurts." The clearest evidence is that simply giving a model more to chew on degrades it: reasoning accuracy drops from 92% to 68% with just 3,000 tokens of padding, far below the context limit, and chain-of-thought prompting doesn't rescue it Does reasoning ability actually degrade with longer inputs?. So more material in the thinking window can actively dilute the signal rather than enrich it.
A big reason is that not all thinking is doing work. When models prune their own reasoning chains, only a handful of token categories matter — symbolic computation tokens are preserved while grammar and meta-discourse get cut first Which tokens in reasoning chains actually matter most?. Reinforcement learning tells the same story from another angle: only about 20% of tokens are high-entropy 'forking points' that actually drive improvement, and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?. Most of the verbiage in a long chain is filler around a few decisive moments — which means length and usefulness are only loosely related. Some models even compute the right answer in their first few layers, then overwrite it to produce format-compliant filler Do transformers hide reasoning before producing filler tokens?: the visible 'thinking' isn't always where the answer lives.
There's also a deeper limit on what more thinking can buy you. Reasoning failures cluster not at hard problems but at unfamiliar ones — models lean on memorized instance patterns rather than general algorithms, so extra steps on a genuinely novel instance don't manufacture the missing capability Do language models fail at reasoning due to complexity or novelty?. And self-improvement is formally capped by a generation-verification gap: a model can't think its way past what it can independently verify without something external What stops large language models from improving themselves?. More deliberation can't close a gap that's structural rather than effortful — and 'potemkin' understanding shows models can produce correct-sounding explanations they then fail to apply, so more explanation isn't more competence Can LLMs understand concepts they cannot apply?.
The interesting turn is that the field is starting to treat 'how much to think' as a decision the model should make. Rather than always reasoning at length, one approach trains a single model to route between extended thinking and quick direct answers, using decoupled RL so it self-calibrates when deliberation is worth it Can models learn when to think versus respond quickly?. That reframes the whole question: the goal isn't maximal thinking, it's *calibrated* thinking — and knowing when to stop is itself a learned skill.
Sources 8 notes
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.