Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
A counterintuitive empirical finding: when comparing correct vs. incorrect solutions to the same questions in o1-like models (QwQ, DeepSeek-R1, LIMO), the correct solutions are systematically shorter. More tokens correlate with wrongness, not rightness.
This directly challenges the "longer = better" narrative underlying much of the test-time scaling literature. If scaling compute leads to longer traces, and longer traces are more likely to be incorrect, then compute scaling via trace extension is actively selecting for worse outputs.
The explanation: longer CoTs contain more self-revisions (see Does self-revision actually improve reasoning in language models?). The model overshoots, revises, introduces errors, and compounds them through revision chains. A model that gets to the right answer quickly does so because it's reasoning correctly, not because it failed to second-guess itself.
The practical implication is that trace length is a poor quality signal — and that training/inference strategies optimizing for longer traces may be optimizing in the wrong direction.
The LLM Strategic Reasoning paper (behavioral game theory evaluation of 22 LLMs) provides independent cross-domain confirmation. In competitive games, top performers (GPT-o1, DeepSeek-R1) produce the shortest CoT within their strongest games. DeepSeek-R1 in competitive games exhibits "repeated self-doubt in its CoT" that creates redundant reasoning loops inflating token usage without improvement. The pattern extends beyond math and coding to strategic interaction: across game types, longer chains signal hesitation and uncertainty, not deeper insight. See Do large language models use one reasoning style or many?.
GaslightingBench-R adds a further dimension: manipulative multi-turn prompts exploit exactly this vulnerability. By introducing misleading content into the chain, adversarial prompts extend the reasoning trace through corrupted steps. The model's own reasoning then elaborates those corrupted steps into longer wrong answers. The same length-wrongness correlation holds, but now as a designed attack surface: longer chains are more exposed to manipulation because there are more points of intervention. Why do reasoning models fail under manipulative prompts? documents this adversarial dimension.
Source: Test Time Compute
Related concepts in this collection
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
the mechanism behind this finding
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the broader overthinking phenomenon
-
Do hedging markers actually signal careful thinking in AI?
Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
supporting evidence from linguistic analysis
-
Why do reasoning models fail under manipulative prompts?
Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
adversarial exploitation of the length-wrongness correlation
-
Do large language models use one reasoning style or many?
Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
independent confirmation from game theory: leaders produce shortest CoT in their strongest games
-
Why do reasoning models struggle with theory of mind tasks?
Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
the inverse-length pattern breaks in social reasoning
-
Can we measure how deeply a model actually reasons?
What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
DTR explains the mechanism: correct traces have higher proportion of deep-thinking tokens (genuine computation) with less low-DTR padding: models produce longer traces for ToM than factual questions yet effort is uncorrelated with accuracy, suggesting the shorter=correct heuristic is domain-specific to formal reasoning
-
Does longer reasoning actually mean harder problems?
Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.
refines the mechanism further: trace length reflects how close the prompt is to training distribution, not how hard the problem is; "correct = shorter" partly recodes "in-distribution = shorter = more likely correct" — the length signal is a training-proximity signal, not purely a reasoning-quality signal
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
correct reasoning traces in o1-like models are shorter than incorrect ones