Can conditioning generation on difficulty probes reduce overthinking on simple tasks?
This explores whether you can first measure how hard a task is — a 'difficulty probe' — and use that signal to stop the model from burning excess reasoning on easy questions; the corpus has a lot on overthinking, but the probe-as-control-signal idea splits into two camps: signals you read at inference vs. signals baked in by training.
This explores whether you can first measure how hard a task is — a 'difficulty probe' — and use that signal to stop the model from burning excess reasoning on easy questions. The short version the corpus supports: yes, but the most reliable probe isn't an estimate of difficulty itself, it's the model's own confidence as it reasons. ReBalance treats confidence variance and overconfidence as live diagnostic signals — when the model is overconfident it's likely padding an easy problem, so a training-free steering vector trims the redundancy, and when confidence wobbles it's underthinking and gets pushed to explore more Can confidence patterns reveal overthinking versus underthinking?. That's a difficulty probe in everything but name, and notably it needs no retraining.
Why bother? Because overthinking isn't a minor inefficiency — it actively destroys accuracy. Test-time scaling is non-monotonic: accuracy peaks at a task-specific token count, then falls off a cliff (one study watched it drop from 87.3% to 70.3% as thinking tokens climbed from ~1,100 to ~16,000), with the extra tokens introducing self-revision errors rather than insight When does thinking too much actually hurt reasoning? Does more thinking time always improve reasoning accuracy?. The same studies note the dual failure mode — models overthink easy problems *and* underthink hard ones — which is exactly why a difficulty-aware controller is attractive: you want to spend the budget where it pays.
Here's the catch the corpus surfaces, and it's the thing you didn't know you wanted to know: the model's own reasoning length is a *bad* proxy for difficulty. Controlled maze experiments show trace length tracks difficulty only for problems near the training distribution — out-of-distribution, the correlation breaks entirely, because trace length mostly reflects recall of memorized schemas, not adaptive computation Does longer reasoning actually mean harder problems?. So a naive probe that reads 'long reasoning = hard problem' will mislead you precisely on the novel cases that matter most. A good difficulty probe has to measure something other than how much the model is already talking.
There's also a deeper version of the problem that conditioning on a probe can't fix. Reasoning models overthink ill-posed questions — ones with missing premises — generating long redundant answers when a non-reasoning model would just flag them as unanswerable. Training optimized for producing reasoning steps but never taught the model *when to disengage* Why do reasoning models overthink ill-posed questions?. An inference-time probe steers within a model's existing repertoire; it doesn't install the judgment to quit. The training-side camp suggests where that judgment comes from: RL doesn't just change how much a model thinks but redirects the same thinking mechanism from counterproductive self-doubt into productive gap analysis Does extended thinking help or hurt model reasoning?, and a related line argues base models already hold latent reasoning that post-training merely selects and elicits rather than creates Do base models already contain hidden reasoning ability?.
Put together, the corpus gives you two complementary answers. Inference-time probes (confidence signals) work, are cheap, and need no retraining — best for the overthink-on-easy-tasks case you asked about. But they ride on a model whose underlying disposition to stop is set by training, and they're only as good as the signal they read — so reach for confidence dynamics, not trace length, and don't expect a probe to teach a model the restraint it was never trained to have.
Sources 7 notes
ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.
Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.