LLM Reasoning and Architecture Design & LLM Interaction Language Understanding and Pragmatics

Why do reasoning models overthink ill-posed questions?

Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.

Note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time?

The standard case for reasoning models: they think more, therefore they reason better. The missing-premise case inverts this completely.

When given questions with missing premises (MiP) — questions that are unanswerable because they lack necessary information — reasoning models produce responses that are drastically longer than for normal questions. The additional length is not useful thinking. It is redundant self-doubt: the model cycles through "alternatively," "wait," "check," and "but" without making progress, unable to resolve the contradiction introduced by the missing premise.

Non-reasoning models behave differently. They produce shorter responses and are significantly more likely to identify the question as ill-posed. They achieve better abstain rates. They do not ruminate.

The mechanism: reasoning-specific training optimizes for generating thinking patterns — for using reasoning steps — but does not develop the meta-capability to recognize when thinking cannot help. The training signal rewards chains that lead to answers. Questions without valid answers do not provide this signal, so no training pressure develops the critical thinking capability to disengage.

Three observations deepen this:

  1. Reasoning models show large increases in step count for MiP questions — most steps are redundant self-doubt
  2. The overthinking is contagious through distillation — models distilled from reasoning model responses inherit the overthinking pattern
  3. The problem generalizes beyond the "missing premises" framing — any question where the correct response is not to reason further will expose this deficit

This contradicts the naïve test-time scaling law assumption. Scaling thinking tokens is supposed to improve outcomes. For ill-posed questions, it does the opposite. The model is burning compute on questions that require no answer, only recognition.

The practical implication for deployed reasoning agents: well-formed questions from trusted sources are fine. Ill-formed, ambiguous, or manipulative questions are not — the reasoning model will not disengage, it will overthink.

Prompting-level mitigation: ISP2 (Iterative Summarization Pre-Prompting) demonstrates that pre-reasoning information gathering can partially address the implicit/missing information problem. The technique extracts entities and their descriptions from the question, rates the reliability of these information pairs, then iteratively merges the lowest-reliability pairs into new descriptions — building a key information pair that is fed alongside the original question into reasoning. The principle: "understanding before reasoning" — CoT emphasizes reasoning stages but neglects the critical prior step of gathering and extracting essential information. ISP2 addresses the missing-premise gap from the prompting side, while training-based approaches like Can models learn to ask clarifying questions instead of guessing? address it from the capability side.

QuestBench extends the picture from behavior to diagnostics: models can't even IDENTIFY what information is missing. At 40-50% accuracy on logic and planning clarification tasks, the information acquisition failure precedes the overthinking failure. See Can models identify what information they actually need? — the two findings describe a two-part deficit: (1) cannot detect what information is needed, (2) cannot disengage when information is absent.

"When Prompts Go Wrong" (2025) extends this to code generation with a systematic taxonomy. Ambiguous descriptions (multiple plausible interpretations), contradictory descriptions (conflicting requirements), and incomplete descriptions (omitted constraints) each cause distinct failure modes. Contradictory descriptions result in the most logical errors — models attempt to satisfy incompatible requirements simultaneously. Incomplete descriptions cause models to make incorrect assumptions (e.g., assuming a base area is provided when "triangular" is omitted). Even larger, more resilient models are not immune. The finding generalizes the missing-premises problem: it is not specific to reasoning tasks but a fundamental vulnerability wherever task specifications are imperfect. Source: Arxiv/Prompts Prompting.


Source: Reasoning Critiques

Related concepts in this collection

Concept map
21 direct connections · 197 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

missing premises exacerbate overthinking — reasoning models lack critical thinking to reject ill-posed questions