Why does chain of thought accuracy eventually decline with length?
Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.
The "longer is better" assumption for CoT has an empirical ceiling: task accuracy initially improves with CoT length, reaches a peak, then decreases. The inverted-U curve applies across models and tasks, and its peak location follows consistent patterns.
Two scaling laws for optimal CoT length:
Difficulty scaling — optimal length increases with task difficulty. Harder problems benefit from longer chains because more decomposition steps are needed. This part matches intuition.
Capability scaling — optimal length decreases with model capability. More capable models find more efficient paths to correct answers and require fewer steps. Using the same long chains for a more capable model is counterproductive.
The second law has a practical consequence: treating all models identically (same token budget, same chain length) misallocates compute. A model that can solve a problem in 5 steps should not be given budgets designed for a 20-step solution.
Simplicity bias as a training-emergent property: RL training reveals this dynamic in action. As RL training improves accuracy, models gravitate toward shorter CoTs — not because they were explicitly trained to be concise, but because shorter chains produce correct answers and RL rewards correct answers. The simplicity bias emerges automatically from the reward signal.
This connects to Why do correct reasoning traces contain fewer tokens? — the same empirical signal: shorter chains are correct chains. The inverted-U explains why: length past the optimal point introduces accumulation of decomposition errors and contextual noise (see Do prior errors in context history amplify future errors?).
The practical implication: train on optimally-lengthed CoTs (not maximal-length), and at inference, use length-aware filtering to discard excessively long chains. The simplicity bias is not a failure mode — it is a signal of genuine capability.
Source: Reasoning Critiques
Related concepts in this collection
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
the empirical observation; this note provides the theoretical model explaining it
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the threshold is not fixed: this note shows it's a function of task difficulty and model capability
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
past the optimal length, variance inflation dominates over quality improvement
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
parallel approaches avoid the problem by distributing tokens across independent chains rather than extending one chain past its optimum
-
Can minimal reasoning chains match full explanations?
Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
empirical operationalization: CoD demonstrates that capable models can achieve full accuracy at 7.6% of standard CoT length, matching the inverted-U prediction that more capable models prefer dramatically shorter chains; the 92.4% of removed tokens were on the declining side of the curve
-
Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
training-time implementation: the generous-to-tight curriculum naturally navigates the inverted-U by allowing exploration of the full curve during early training then compressing to the optimal point; models discover the peak with generous budgets and descend toward conciseness under tightening constraints
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
optimal cot length follows an inverted-u — more capable models prefer shorter cot