Why do language models overthink simple questions when given extra time?
This explores why reasoning models burn extra compute on questions that don't need it — and what the corpus says is actually breaking when 'more thinking' makes answers worse, not better.
This explores why reasoning models burn extra compute on questions that don't need it — and the corpus points to a surprising culprit: overthinking isn't a thinking problem, it's a *stopping* problem. Models are trained to generate reasoning steps but almost never trained on when to disengage. When a question is ill-posed or missing a premise, reasoning models keep elaborating — producing long, redundant chains — while plain non-reasoning models simply flag it as unanswerable Why do reasoning models overthink ill-posed questions?. Extra time doesn't buy more correctness here; it buys more rope.
The most counterintuitive finding is that giving a model more inference-time compute can actively *degrade* its judgment. On deliberately flawed math problems, scaling up thinking made untrained models *worse* at noticing the flaw — yet the same scaling helped after the model was explicitly trained to think critically Can models learn to ask clarifying questions instead of guessing?. So 'extra time' is not neutral. Without a learned sense of when to stop, more steps just amplify whatever the model was already doing — including chasing a malformed question deeper.
Part of the answer is that knowing-when-to-think is a separate skill that has to be trained in on purpose. Thinkless trains a single model to route between extended reasoning and a direct answer, decoupling the 'should I think?' decision from the 'what's the answer?' refinement so the model can self-calibrate by difficulty rather than defaulting to maximum effort on everything Can models learn when to think versus respond quickly?. Overthinking, in this light, is what happens when that routing layer is missing — the model has only one gear.
There's also a deeper question of whether the long chain is even doing real work. Logit-lens analysis shows transformers can compute the correct answer in their first few layers and then overwrite it with format-compliant filler — the visible reasoning isn't always where the answer comes from Do transformers hide reasoning before producing filler tokens?. And reasoning models don't break at a complexity threshold so much as at unfamiliar instances; they pattern-match to training examples rather than running a general algorithm, so a long chain succeeds or fails based on novelty, not length Do language models fail at reasoning due to complexity or novelty?. That reframes 'overthinking' as effort spent regardless of whether it's the kind of problem extra effort can solve.
The through-line the corpus draws: models are optimized to *produce* reasoning, not to *withhold* it. Whether it's failing to reject a missing premise, failing to ignore a distractor, or failing to route to a quick answer, the same training gap appears — systems learn what to do far better than what *not* to do, and on simple questions, restraint is the missing instruction.
Sources 5 notes
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.