INQUIRING LINE

Can models overthink and underthink at the same time?

This explores whether the two opposite failure modes of reasoning models — burning tokens on redundant overthinking and bailing out of good ideas too early (underthinking) — can actually show up together, and what the corpus says about diagnosing and fixing both at once.


This explores whether overthinking and underthinking are really two separate problems or two faces of the same broken thermostat — and the corpus suggests they coexist constantly, often inside the same response. The cleanest evidence comes from Can confidence patterns reveal overthinking versus underthinking?, which treats them as a single regulation problem: it watches a model's confidence moment-to-moment and steers in opposite directions depending on what it sees — trimming redundant loops when the model is overconfident and spinning its wheels, while pushing exploration when the model is bailing too early. The fact that one mechanism has to push *both* ways implies a model can be doing both within a single trace.

The two failure modes have surprisingly different fingerprints. Overthinking, in When does thinking too much actually hurt reasoning?, is non-monotonic scaling: accuracy peaks at some token count and then falls off a cliff (87% down to 70% as tokens climb), because extra thinking inflates variance and breeds self-revision errors. Underthinking, in Do reasoning models switch between ideas too frequently?, is the opposite waste — o1-style models abandon promising reasoning paths mid-stream and hop to new ones, spending tokens without ever finishing a thought. Notice these can happen at once: a model can switch ideas too often (underthinking) *and* over-elaborate each abandoned fragment (overthinking), so it's simultaneously too shallow in commitment and too verbose in execution.

Why do models land in this double-bind? Why do reasoning models overthink ill-posed questions? gives a structural reason: training rewards producing reasoning steps but never teaches a model *when to stop* — so when handed an unanswerable question, reasoning models keep generating while plain models just say 'this can't be answered.' The disengage switch was never installed. And Does extended thinking help or hurt model reasoning? shows the same extended-thinking machinery can be either helpful or harmful depending on training: vanilla models use it for counterproductive self-doubt (which looks like underthinking-by-second-guessing), while RL redirects it into productive gap analysis. The mechanism is neutral; quality is about regulation, not quantity.

The deeper lesson the corpus offers is that fixing this means *measuring effort honestly*, not just counting tokens. Can we measure how deeply a model actually reasons? proposes tracking how much a model actually revises its predictions across its internal layers — genuine reasoning vs. going-through-the-motions — which can distinguish a model that's truly stuck from one that just looks busy. That matters because Do reasoning traces show how models actually think? warns the visible reasoning trace is partly theater: even logically invalid steps produce comparable performance, so you can't trust the trace's length or apparent care to tell you whether the model is over- or under-thinking. The token count lies; you have to look inside.

What you didn't know you wanted to know: overthinking and underthinking aren't a spectrum with a healthy middle — they're better understood as a *failure of self-knowledge*. Do models know what they don't know? shows models carry internal machinery for sensing whether they actually know something, and that signal causally steers behavior. A model that knew when it knew would neither keep grinding nor bail early. So the real frontier isn't 'think more' or 'think less' — it's giving models a calibrated sense of their own uncertainty, which is exactly the confidence signal ReBalance exploits and the stochastic uncertainty-holding that Can stochastic latent reasoning help models explore multiple solutions? builds directly into the reasoning loop.


Sources 9 notes

Can confidence patterns reveal overthinking versus underthinking?

ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.

When does thinking too much actually hurt reasoning?

Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can we measure how deeply a model actually reasons?

Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Next inquiring lines