Reinforcement Learning for LLMs LLM Reasoning and Architecture

Why does chain of thought accuracy eventually decline with length?

Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.

Note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time?

The "longer is better" assumption for CoT has an empirical ceiling: task accuracy initially improves with CoT length, reaches a peak, then decreases. The inverted-U curve applies across models and tasks, and its peak location follows consistent patterns.

Two scaling laws for optimal CoT length:

  1. Difficulty scaling — optimal length increases with task difficulty. Harder problems benefit from longer chains because more decomposition steps are needed. This part matches intuition.

  2. Capability scaling — optimal length decreases with model capability. More capable models find more efficient paths to correct answers and require fewer steps. Using the same long chains for a more capable model is counterproductive.

The second law has a practical consequence: treating all models identically (same token budget, same chain length) misallocates compute. A model that can solve a problem in 5 steps should not be given budgets designed for a 20-step solution.

Simplicity bias as a training-emergent property: RL training reveals this dynamic in action. As RL training improves accuracy, models gravitate toward shorter CoTs — not because they were explicitly trained to be concise, but because shorter chains produce correct answers and RL rewards correct answers. The simplicity bias emerges automatically from the reward signal.

This connects to Why do correct reasoning traces contain fewer tokens? — the same empirical signal: shorter chains are correct chains. The inverted-U explains why: length past the optimal point introduces accumulation of decomposition errors and contextual noise (see Do prior errors in context history amplify future errors?).

The practical implication: train on optimally-lengthed CoTs (not maximal-length), and at inference, use length-aware filtering to discard excessively long chains. The simplicity bias is not a failure mode — it is a signal of genuine capability.


Source: Reasoning Critiques

Related concepts in this collection

Concept map
13 direct connections · 147 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

optimal cot length follows an inverted-u — more capable models prefer shorter cot