Does longer reasoning actually mean harder problems?
Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.
A prevailing assumption: longer reasoning traces indicate more thinking effort, therefore more complex problems should produce longer traces. Controlled experiments undercut this completely.
Training transformer models from scratch on derivational traces of the A* search algorithm — where problem complexity is precisely controllable and verifiable — reveals the decoupling:
- On in-distribution problems, trace length shows some alignment with difficulty
- On trivially simple problems (free-space mazes without obstacles), models often produce excessively long traces and sometimes fail to produce solutions
- On out-of-distribution problems, trace length and complexity become entirely decoupled — no correlation
The interpretation: intermediate token sequence length reflects approximate recall from the training distribution, not problem-adaptive computation. When a problem is close to training examples, the model retrieves a matching schema whose length reflects the training data's length distribution for that problem type. When a problem is far from training, the model has no calibrated schema to retrieve — trace length becomes arbitrary.
This challenges the entire anthropomorphic framing of "thinking time." When DeepSeek-R1 or similar models produce long chains, the conventional interpretation is that the problem is hard and the model is "working through it." The A* evidence suggests the length may primarily indicate how close the problem is to training distribution, not how much genuine computation is occurring.
The practical implication: trace length is not a reliable proxy for problem difficulty. Length-based scaling heuristics (add more tokens for harder problems) may be calibrating to the wrong signal. Does more thinking time always improve reasoning accuracy? supports this: more tokens do not reliably help after a certain point.
This also deepens Does chain-of-thought reasoning reveal genuine inference or pattern matching?: if trace length reflects training distribution proximity, then even the amount of imitation is calibrated to training similarity, not actual inferential needs.
Source: Reasoning Critiques
Related concepts in this collection
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
the within-distribution case: correct traces are shorter because they found the right schema quickly; this note explains the mechanism
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
practical consequence: tokens past the threshold reflect distribution mismatch, not useful computation
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
trace length is another dimension of imitation: how much training data looks like this problem
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
complementary: extended thinking broadens output distribution, not reasoning quality; trace length is part of this variance
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
cot trace length reflects training distribution proximity, not problem difficulty