Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs

Paper · arXiv 2505.00127 · Published April 30, 2025

Large language models (LLMs) are increasingly optimized for long reasoning, under the assumption that more reasoning leads to better performance. However, emerging evidence suggests that longer responses can sometimes degrade accuracy rather than improve it. In this paper, we conduct a systematic empirical study of the relationship between reasoning length and answer correctness. We find that LLMs tend to overthink simple problems—generating unnecessarily long outputs—and underthink harder ones, failing to extend their reasoning when it is most needed, indicating that models might misjudge problem difficulty and fail to calibrate their response length appropriately. Furthermore, we investigate the effects of length reduction with preference optimization algorithm when simply preferring the shorter responses regardless of answer correctness. Experiments show that the generation length can be significantly reduced while maintaining an acceptable accuracy.

• For a fixed question, accuracy exhibits a non-monotonic relationship with reasoning length: it tends to increase with longer responses up to a point, but declines when the reasoning becomes excessively long.

• Questions that are answered incorrectly tend to have longer average reasoning lengths than correctly answered ones, likely due to their inherent complexity and the need for more reasoning steps. However, models do not consistently recognize or adapt to problem difficulty. For relatively easy questions, models are often able to detect smaller increases in difficulty and appropriately extend their reasoning. In contrast, when encountering problems beyond their capabilities, models sometimes underthink—either failing to recognize the increased difficulty or lacking the necessary knowledge to respond effectively. As a result, they often generate responses that are shorter than needed to arrive at a correct answer.

Building on our analysis, we further investigate the effect of length based preference optimization. Specifically, we explore whether encouraging the model to produce shorter responses—without directly optimizing for correctness—can still preserve answer accuracy. Using only unlabeled data, we fine-tune the model to prefer shorter responses and find that this preference tuning can maintain relatively strong accuracy while reducing token length, even in the absence of ground-truth supervision. Though the token length reduction is largely due to the reduced length for the incorrect responses, since they are significantly longer than correct responses and easier to reduce. However, we also observe a 10%–25% reduction in the length of correct responses. This suggests that while the model is capable of recognizing question difficulty and adjusting its generation length accordingly for easier problems, a tendency toward overthinking still exists.