Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the “test-time scaling law” but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models’ responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.
This phenomenon indicates that although most existing reasoning models have thinking and reasoning capabilities to some extent, they lack the critical thinking capabilities to “reject” ill-posed questions. By contrast, non-reasoning models, though they are not explicitly trained for reasoning, tend to strike a better balance, generating shorter answers that are more likely to acknowledge MiP when the question is ill-posed. This phenomenon reveals a surprising contradiction on test-time scaling law
Specifically, values of alternatively, wait, check, and but can be directly counted from the model responses, including the thinking paths of reasoning models. Hypothesis category includes several key words, including perhaps, maybe, and might. Step represents the step counts, spited by \n\n.
Moreover, when comparing the changes of steps, reasoning models exhibit a large increase in step count for MiP questions, while non-reasoning models typically show fewer steps, suggesting they quickly conclude the question is unanswerable. With this gap, together with the consistently better abstain rates of the non-reasoning models, we conclude that the lengthy reasoning steps are mostly redundant and indicate self-doubt thinking patterns for reasoning models.