Why do simple math problems get worse with longer reasoning chains?
This explores why adding more reasoning steps can hurt accuracy on easy problems — and what's actually breaking when a model 'overthinks.'
This explores why adding more reasoning steps can hurt accuracy on easy problems — and what's actually breaking when a model 'overthinks.' The corpus is surprisingly unified on the headline: more thinking is not free. Accuracy follows an inverted U — it climbs with chain length up to a point, then falls — and the optimal length is shorter for easy problems and for more capable models Why does chain of thought accuracy eventually decline with length?. One study put hard numbers on the decline: stretching a model from ~1,100 to ~16,000 thinking tokens dropped benchmark accuracy from 87% to 70%, precisely because models overthink the easy cases and underthink the hard ones Does more thinking time always improve reasoning accuracy?. So for a simple math problem, a long chain is past the peak of the curve before it even helps.
Why does the extra length actively cause errors rather than just waste tokens? Two failure modes show up repeatedly. First, models wander: they explore invalid paths and abandon promising ones prematurely — 'tourists, not scientists' — so each extra step is a fresh chance to drift off the correct answer rather than converge on it Why do reasoning models abandon promising solution paths? Why do reasoning LLMs fail at deeper problem solving?. Second, the longer the trace, the more intermediate steps there are to get wrong; reliability work finds most failures are process violations in the middle of a trace, not a botched final computation — checking intermediate states lifted task success from 32% to 87% Where do reasoning agents actually fail during long traces?. A simple problem has a short correct path, so a long chain is mostly added surface area for mistakes.
The most counterintuitive thread reframes what chain length even measures. It's tempting to assume longer reasoning means a harder problem — but controlled maze experiments show trace length tracks how close a problem sits to the training distribution, not its actual difficulty Does longer reasoning actually mean harder problems?. A related finding pushes further: reasoning breaks at instance-level *unfamiliarity*, not task complexity, because models fit memorized instance patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. So a 'simple' math problem that happens to be phrased unfamiliarly can trigger a long, schema-hunting chain that the model can't actually ground — length is a symptom of the model casting about, not of genuine effort paying off.
There's also a training-side culprit: models are optimized to produce reasoning steps but never taught when to stop. Given an ill-posed or missing-premise question, reasoning models churn out redundant length while plainer non-reasoning models simply flag it as unanswerable — the reasoning habit has no off-switch Why do reasoning models overthink ill-posed questions?. Even prompt structure matters: for simple questions, letting the question flow straight to an answer beats forcing step-by-step reasoning, because CoT helps only when the question's information aggregates into the prompt first Why do some questions perform better without step-by-step reasoning?. Encouragingly, the same inverted-U paper notes that RL training naturally pulls chains *shorter* as models improve — brevity emerges from reward, suggesting the fix isn't more thinking but knowing when a problem is already solved Why does chain of thought accuracy eventually decline with length?.
Sources 9 notes
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.