When More is Less: Understanding Chain-of-Thought Length in LLMs

Paper · arXiv 2502.07266 · Published February 11, 2025
Reasoning CritiquesReinforcement LearningEvaluations

Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate towards shorter CoTs as their accuracy improves. To have a deep understanding of these dynamics, we establish a simple theoretical model that formally proves these phenomena, including the optimal length’s scaling laws and the emergence of simplicity bias during RL. Guided by this framework, we demonstrate significant practical benefits from training with optimally-lengthed CoTs and employing length-aware filtering at inference.

Intuitively, task decomposition into more steps yields easier subtask but also accumulate errors exponentially, leading to an optimal tradeoff at an intermediate CoT length. Notably, this theory also explains the emergence of the simplicity bias as observed during RL training. Thus, although simple, our theory provides valuable characterization of LLMs’ behaviors during CoT. Translating this understanding into practice, we show significant benefits from training with optimally-lengthed CoTs and employing Length-aware Vote to filter out excessively long CoTs at inference.