LLM Reasoning and Architecture Reinforcement Learning for LLMs

Why do correct reasoning traces contain fewer tokens?

In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

A counterintuitive empirical finding: when comparing correct vs. incorrect solutions to the same questions in o1-like models (QwQ, DeepSeek-R1, LIMO), the correct solutions are systematically shorter. More tokens correlate with wrongness, not rightness.

This directly challenges the "longer = better" narrative underlying much of the test-time scaling literature. If scaling compute leads to longer traces, and longer traces are more likely to be incorrect, then compute scaling via trace extension is actively selecting for worse outputs.

The explanation: longer CoTs contain more self-revisions (see Does self-revision actually improve reasoning in language models?). The model overshoots, revises, introduces errors, and compounds them through revision chains. A model that gets to the right answer quickly does so because it's reasoning correctly, not because it failed to second-guess itself.

The practical implication is that trace length is a poor quality signal — and that training/inference strategies optimizing for longer traces may be optimizing in the wrong direction.

The LLM Strategic Reasoning paper (behavioral game theory evaluation of 22 LLMs) provides independent cross-domain confirmation. In competitive games, top performers (GPT-o1, DeepSeek-R1) produce the shortest CoT within their strongest games. DeepSeek-R1 in competitive games exhibits "repeated self-doubt in its CoT" that creates redundant reasoning loops inflating token usage without improvement. The pattern extends beyond math and coding to strategic interaction: across game types, longer chains signal hesitation and uncertainty, not deeper insight. See Do large language models use one reasoning style or many?.

GaslightingBench-R adds a further dimension: manipulative multi-turn prompts exploit exactly this vulnerability. By introducing misleading content into the chain, adversarial prompts extend the reasoning trace through corrupted steps. The model's own reasoning then elaborates those corrupted steps into longer wrong answers. The same length-wrongness correlation holds, but now as a designed attack surface: longer chains are more exposed to manipulation because there are more points of intervention. Why do reasoning models fail under manipulative prompts? documents this adversarial dimension.


Source: Test Time Compute

Related concepts in this collection

Concept map
21 direct connections · 184 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

correct reasoning traces in o1-like models are shorter than incorrect ones