Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
The standard case for reasoning models: they think more, therefore they reason better. The missing-premise case inverts this completely.
When given questions with missing premises (MiP) — questions that are unanswerable because they lack necessary information — reasoning models produce responses that are drastically longer than for normal questions. The additional length is not useful thinking. It is redundant self-doubt: the model cycles through "alternatively," "wait," "check," and "but" without making progress, unable to resolve the contradiction introduced by the missing premise.
Non-reasoning models behave differently. They produce shorter responses and are significantly more likely to identify the question as ill-posed. They achieve better abstain rates. They do not ruminate.
The mechanism: reasoning-specific training optimizes for generating thinking patterns — for using reasoning steps — but does not develop the meta-capability to recognize when thinking cannot help. The training signal rewards chains that lead to answers. Questions without valid answers do not provide this signal, so no training pressure develops the critical thinking capability to disengage.
Three observations deepen this:
- Reasoning models show large increases in step count for MiP questions — most steps are redundant self-doubt
- The overthinking is contagious through distillation — models distilled from reasoning model responses inherit the overthinking pattern
- The problem generalizes beyond the "missing premises" framing — any question where the correct response is not to reason further will expose this deficit
This contradicts the naïve test-time scaling law assumption. Scaling thinking tokens is supposed to improve outcomes. For ill-posed questions, it does the opposite. The model is burning compute on questions that require no answer, only recognition.
The practical implication for deployed reasoning agents: well-formed questions from trusted sources are fine. Ill-formed, ambiguous, or manipulative questions are not — the reasoning model will not disengage, it will overthink.
Prompting-level mitigation: ISP2 (Iterative Summarization Pre-Prompting) demonstrates that pre-reasoning information gathering can partially address the implicit/missing information problem. The technique extracts entities and their descriptions from the question, rates the reliability of these information pairs, then iteratively merges the lowest-reliability pairs into new descriptions — building a key information pair that is fed alongside the original question into reasoning. The principle: "understanding before reasoning" — CoT emphasizes reasoning stages but neglects the critical prior step of gathering and extracting essential information. ISP2 addresses the missing-premise gap from the prompting side, while training-based approaches like Can models learn to ask clarifying questions instead of guessing? address it from the capability side.
QuestBench extends the picture from behavior to diagnostics: models can't even IDENTIFY what information is missing. At 40-50% accuracy on logic and planning clarification tasks, the information acquisition failure precedes the overthinking failure. See Can models identify what information they actually need? — the two findings describe a two-part deficit: (1) cannot detect what information is needed, (2) cannot disengage when information is absent.
"When Prompts Go Wrong" (2025) extends this to code generation with a systematic taxonomy. Ambiguous descriptions (multiple plausible interpretations), contradictory descriptions (conflicting requirements), and incomplete descriptions (omitted constraints) each cause distinct failure modes. Contradictory descriptions result in the most logical errors — models attempt to satisfy incompatible requirements simultaneously. Incomplete descriptions cause models to make incorrect assumptions (e.g., assuming a base area is provided when "triangular" is omitted). Even larger, more resilient models are not immune. The finding generalizes the missing-premises problem: it is not specific to reasoning tasks but a fundamental vulnerability wherever task specifications are imperfect. Source: Arxiv/Prompts Prompting.
Source: Reasoning Critiques
Related concepts in this collection
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
missing premises may push models past any threshold by making the threshold undefined
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
MiP is a particularly sharp falsification: more thinking is not just unhelpful, it actively produces worse behavior
-
Why do reasoning models fail under manipulative prompts?
Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
same vulnerability pattern: reasoning models trained to use thinking are more susceptible to scenarios where thinking doesn't help
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
connects directly: reasoning training reduces appropriate non-answering
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
diagnostic complement: models can't identify what's missing (QuestBench 40-50%), then overthink when it IS missing
-
How do users actually form intent when prompting AI systems?
Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
when users provide incomplete intent (the default condition), reasoning models overthink rather than recognizing the gap and helping users mature their intent
-
Why do language models lose performance in longer conversations?
Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
multi-turn conversation is the natural habitat of missing premises: gradually revealed instructions create underspecification that reasoning models overthink rather than resolve; the Mediator-Assistant architecture separates the problem, preventing overthinking at the execution stage
-
When should retrieval happen during model generation?
Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
active retrieval offers a constructive exit from the overthinking spiral: when uncertainty is detected, retrieve external information instead of generating more reasoning tokens; without this mechanism, the model can only ruminate
-
Can models reason without generating visible thinking steps?
Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.
latent recurrence with bounded depth provides an architectural constraint against rumination: verbalized reasoning models cannot stop token generation when premises are missing, but bounded latent iteration would naturally cap unproductive cycles rather than spiraling into self-doubt
-
Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
multi-turn conversation is the natural habitat of missing premises: gradually revealed instructions create exactly the underspecification that triggers overthinking rather than clarification; the 39% degradation is the conversational cost of the critical thinking deficit
-
Why do users drift away from their original information need?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
users in ASK states naturally produce the incomplete queries that trigger overthinking: they know they need something but cannot specify what, producing vague questions with implicit missing premises that reasoning models ruminate on rather than recognizing as underspecified
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
missing premises exacerbate overthinking — reasoning models lack critical thinking to reject ill-posed questions