Why do correct reasoning traces appear shorter than incorrect ones?
This explores why, in o1-style reasoning models, the chains of thought that land on correct answers tend to use fewer tokens than the ones that get it wrong — and what that says about what longer 'thinking' is actually doing.
This explores why correct reasoning traces tend to be shorter than incorrect ones in o1-like models — and the short version is that extra length is often a symptom of trouble, not a sign of deeper thinking. Across QwQ, DeepSeek-R1, and LIMO, correct solutions average fewer tokens, and the longer a trace runs the more self-revisions it accumulates — revisions that tend to introduce and compound errors rather than repair them Why do correct reasoning traces contain fewer tokens?. So the causation often runs backwards from what you'd assume: the model isn't failing because it's confused and therefore writing more; it's writing more, second-guessing itself, and talking itself out of a right answer.
That picture sharpens once you stop treating trace length as a measure of difficulty. Controlled maze experiments show length only tracks problem difficulty when the problem looks like the training data — out of distribution, the correlation breaks entirely, suggesting length mostly reflects how well the model can recall a familiar schema rather than how much 'real' computation a hard problem demands Does longer reasoning actually mean harder problems?. Seen that way, a short correct trace is the model confidently walking a well-worn path; a long incorrect one is the model wandering because it never had the path to begin with.
The wandering is literal. Reasoning models exhibit two reinforcing failure modes — invalid exploration (chasing dead ends) and underthinking (abandoning promising paths too early) — and these inflate length while degrading accuracy. Notably, decoding-level nudges like penalizing thought-switching improve accuracy without any fine-tuning, which means the right answer was often reachable and got talked away Why do reasoning models abandon promising solution paths?. There's even an optimal length: accuracy follows an inverted-U, peaking at intermediate trace length, and — strikingly — more capable models prefer shorter chains. RL training naturally drifts toward brevity as models improve, so conciseness emerges from the reward signal rather than being trained in directly Why does chain of thought accuracy eventually decline with length?.
Here's the part you might not expect to want to know: the length isn't really 'reasoning' getting longer at all. Traces with deliberately corrupted or logically invalid steps perform nearly as well as clean ones, and the intermediate tokens carry no special execution semantics — they're generated like any other output Do reasoning traces need to be semantically correct? Do reasoning traces actually cause correct answers?. Format and spatial structure shape performance far more than logical content What makes chain-of-thought reasoning actually work?. If the prose is partly stylistic scaffolding rather than load-bearing logic, then a bloated trace is closer to a model rambling than to it reasoning harder — and rambling is where mistakes creep in.
The practical upshot is in how you'd catch and use this. Quality beats quantity: step-level confidence scoring catches reasoning breakdowns that whole-trace averaging hides, and lets you stop early before a trace spirals into the self-revisions that sink it Does step-level confidence outperform global averaging for trace filtering?. Not all length is equal either — a sparse set of planning and backtracking 'anchor' sentences does most of the real steering, so the useful content can be small even inside a long trace Which sentences actually steer a reasoning trace?. And if you want to evaluate honestly, score the final solution, not the trace: trace-based grading inflates results by rewarding stylistic mimicry, while solution-verification exposes the true ceiling Should reasoning benchmarks score final answers or reasoning traces?.
Sources 10 notes
Across QwQ, DeepSeek-R1, and LIMO, correct solutions average fewer tokens than incorrect ones. Longer traces correlate with more self-revisions, which introduce and compound errors rather than improve reasoning quality.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.
LR²Bench scores only final answers against deterministic ground truth, not reasoning steps. This methodological choice reveals a 20% ceiling that trace-based evaluation would inflate by counting stylistic reasoning mimicry as actual reasoning capability.