INQUIRING LINE

Why do models overthink easy problems and underthink difficult ones?

This explores why reasoning models spend too many tokens on simple questions yet quit too early on hard ones — and the corpus points at a single root cause: models can often sense difficulty but don't act on that signal.


This explores why reasoning models spend too many tokens on simple questions yet quit too early on hard ones. The most striking finding in the corpus is that the problem usually isn't perception — it's commitment. Linear probes can decode a question's difficulty straight from a model's hidden states *before* it writes a single reasoning step, yet the model overthinks the easy ones anyway Can models recognize question difficulty before they reason?. So the model 'knows' but overrides what it knows. Part of why is in how training shapes reasoning: optimizing for producing reasoning steps never teaches a model when to *stop* or disengage, so it keeps generating even when a question is ill-posed or trivially answerable Why do reasoning models overthink ill-posed questions?.

A second thread complicates the easy/hard framing itself: longer reasoning traces don't reliably track problem difficulty at all. In controlled maze experiments, trace length correlates with difficulty only when the problem looks like the training data — and decouples completely once you go out-of-distribution. Trace length mostly reflects recall of familiar schemas, not adaptive computation Does longer reasoning actually mean harder problems?. That reframes 'underthinking hard problems' as something closer to 'the model never learned the right pattern.' Reasoning breakdowns track instance-level *unfamiliarity*, not abstract complexity — a model will succeed on a long chain it has seen and fail on a short one it hasn't Do language models fail at reasoning due to complexity or novelty?.

The training dynamics underneath this are uneven across difficulty. Easy problems reinforce answer shortcuts while actively suppressing deliberation; hard problems only activate reasoning features on the rare occasions the model succeeds, so those features barely get reinforced; medium difficulty is where both strengthen together What reasoning features does each difficulty level reinforce?. That asymmetry helps explain the lopsided behavior: the model is pulled toward shortcuts on easy items and starved of reinforcement on hard ones. And more thinking isn't a free fix — accuracy is non-monotonic, peaking at a task-specific token count and then falling sharply as extended thinking inflates variance and introduces self-revision errors When does thinking too much actually hurt reasoning?.

What's quietly encouraging is that the failure is also the fix. Because difficulty is legible inside the model, you can steer on it without retraining: ReBalance reads confidence variance and overconfidence as live diagnostics, then applies steering vectors that cut redundant overthinking and push exploration during underthinking, across models from 0.5B to 32B Can confidence patterns reveal overthinking versus underthinking?. There's even evidence the model already adapts internally — hidden states sparsify systematically as tasks get harder and less familiar, acting as a selective filter that stabilizes performance rather than a breakdown Do language models sparsify their activations under difficult tasks?. The thing you didn't know you wanted to know: the gap between what a model perceives about difficulty and what it does about it is wide enough that you can close it from the outside, at inference time, without touching the weights.


Sources 8 notes

Can models recognize question difficulty before they reason?

Linear probes successfully decode difficulty from LRM representations before reasoning begins, yet models still overthink simple questions. This reveals an action-commitment failure rather than a perception failure.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

What reasoning features does each difficulty level reinforce?

Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems activate reasoning features only on rare success; medium difficulty strengthens both simultaneously. Identical accuracy gains can reflect opposite internal changes.

When does thinking too much actually hurt reasoning?

Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.

Can confidence patterns reveal overthinking versus underthinking?

ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Next inquiring lines