Why do longer reasoning chains correlate with lower accuracy in o1-like models?

This explores why o1-style reasoning models often get *less* accurate as their chains of thought grow longer — and whether length itself is the cause or just a symptom of something deeper.

This explores why o1-style reasoning models often get *less* accurate as their chains of thought grow longer — and whether length itself is the cause or just a symptom. The corpus points to a counterintuitive answer: longer is rarely better, and the length is usually a *signal* of trouble rather than the trouble itself. Accuracy as a function of chain length traces an inverted-U — it climbs to an intermediate sweet spot, then falls — and that optimal point shrinks as models get more capable, so the strongest models actually prefer shorter chains Why does chain of thought accuracy eventually decline with length?. Strikingly, you can match verbose reasoning at a fraction of the cost: minimal "draft" chains hit equivalent accuracy on arithmetic and commonsense tasks using just 7.6% of the tokens, because most of the removed words were doing stylistic and documentation work, not computation Can minimal reasoning chains match full explanations?.

So what fills the extra tokens when chains run long and wrong? A lot of it is wasted motion. Reasoning models tend to *underthink* — they abandon promising solution paths mid-exploration and switch to new ones prematurely, burning tokens on half-finished approaches. A simple decoding-time penalty on thought-transition tokens curbs the switching and lifts accuracy with no retraining at all Do reasoning models switch between ideas too frequently?. The same picture shows up as two reinforcing failures — "wandering" (invalid exploration) and underthinking — that are structural disorganization, not a shortage of compute; the right answer was often reachable but got dropped Why do reasoning models abandon promising solution paths?. Length, in other words, frequently measures thrashing.

There's also a more mechanical reason long chains decay: every additional step is another place for an error to enter and propagate. Under manipulative multi-turn prompts, reasoning models drop 25–29% in accuracy precisely because extended chains create more corruption points where a single wrong step snowballs into a confident wrong conclusion Are reasoning models actually more vulnerable to manipulation?. More reasoning does dampen sensitivity to noisy inputs, but a robustness floor exists structurally — extra steps reduce perturbation but can never zero it out Can longer reasoning chains eliminate model sensitivity to input noise?. And token-level analysis finds that local memorization — predicting based on just-preceding tokens rather than genuine reasoning — accounts for up to 67% of errors, an effect that gets worse as complexity and distributional shift grow Where do memorization errors arise in chain-of-thought reasoning?.

Here's the part you might not expect: the length itself often isn't tracking problem difficulty at all. In controlled maze experiments, trace length correlates with difficulty only *inside* the training distribution and decouples completely outside it — long traces mostly reflect recalled training schemas, not adaptive computation on a hard problem Does longer reasoning actually mean harder problems?. That reframes the whole correlation: models don't fail at some complexity threshold, they fail at instance *novelty* boundaries, fitting memorized instance patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. When you push them off-distribution, chain-of-thought degrades predictably, producing fluent-but-logically-inconsistent reasoning — the *form* of thinking without the validity Does chain-of-thought reasoning actually generalize beyond training data?.

Two more notes that widen the picture. Even raw input length hurts before context limits are anywhere near full — accuracy falls from 92% to 68% with just 3,000 tokens of padding, and chain-of-thought prompting doesn't rescue it Does reasoning ability actually degrade with longer inputs?. And when problems genuinely demand sustained long-chain reflection and backtracking, frontier models like o1-preview and DeepSeek-R1 hit a ceiling around 20–23% on constraint-satisfaction tasks — fluent reflection doesn't convert into real problem-solving on unfamiliar structures Can reasoning models actually sustain long-chain reflection?. The takeaway: long chains correlate with low accuracy because length is usually a symptom — of off-distribution recall, premature path-switching, and accumulating error — not a dial you can turn up to think harder.

Sources 12 notes

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Are reasoning models actually more vulnerable to manipulation?

GaslightingBench-R shows that multi-turn manipulative prompts reduce reasoning model accuracy significantly more than standard models. Extended chains create more corruption points, allowing single wrong steps to propagate into confident incorrect conclusions.

Can longer reasoning chains eliminate model sensitivity to input noise?

Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Why do longer reasoning chains correlate with lower accuracy in o1-like models?

Sources 12 notes

Next inquiring lines