Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can confidence patterns reveal overthinking versus underthinking?

This explores whether real-time confidence signals can diagnose when a reasoning model is trapped in redundant deliberation versus committing prematurely, and whether steering based on these signals can balance both failure modes.

Note · 2026-04-01 · sourced from Reasoning by Reflection

Overthinking and underthinking are dual failures, and existing methods that suppress one often induce the other. Suppressing reflective keywords or truncating reasoning length reduces overthinking but causes underthinking — the model doesn't explore enough. Forcing longer chains reduces underthinking but generates redundancy. ReBalance resolves this by treating confidence as a continuous diagnostic signal rather than using binary interventions.

The diagnostic: Confidence values correlate with reasoning behavior in interpretable ways:

High confidence variance — frequent indecisive switching between reasoning paths, causing redundant steps and delayed convergence. This IS overthinking: the model knows something is wrong but can't commit.
Consistent overconfidence — premature commitment to an incorrect reasoning path. This IS underthinking: the model commits too early without adequate exploration.

The mechanism: From a small-scale dataset, identify reasoning steps indicating each mode. Aggregate their hidden states into reasoning mode prototypes. Compute a steering vector encoding the transition from overthinking to underthinking. A dynamic control function modulates the vector's strength and direction based on real-time confidence: pruning redundancy during overthinking, promoting exploration during underthinking.

Why it's training-free: The steering vector captures the model's inherent reasoning dynamics — it's extracted from the model's own hidden states, not trained. Because it operates on intrinsic representations, it generalizes across unseen data and tasks (math, QA, coding). This makes it plug-and-play across models from 0.5B to 32B.

Since Can we steer reasoning toward brevity without retraining?, ReBalance extends the activation-steering approach from length compression to reasoning quality management. ASC steers between verbose and concise modes; ReBalance steers between overthinking and underthinking — a qualitative distinction, not just quantitative.

Since Does more thinking time always improve reasoning accuracy?, ReBalance provides the dynamic mechanism the threshold finding calls for: instead of a fixed cutoff, confidence-based steering continuously adjusts the reasoning trajectory.

Source: Reasoning by Reflection Paper: Efficient Reasoning with Balanced Thinking

Related concepts in this collection

Can we steer reasoning toward brevity without retraining? This explores whether model reasoning style occupies learnable geometric directions in activation space, and whether we can shift toward concise thinking by steering through that space without expensive retraining.
ASC compresses length; ReBalance steers reasoning quality
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
ReBalance provides the dynamic mechanism: confidence-based steering vs fixed threshold
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
complementary: underthinking penalty addresses premature switching; ReBalance addresses premature commitment via overconfidence detection
When should an agent actually stop and deliberate? How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
both use uncertainty/confidence as the trigger for compute allocation

Concept map

12 direct connections · 103 in 2-hop network ·medium cluster

Can confidence patterns reveal overthinking vers… Can we steer reasoning toward brevity without retr… Does more thinking time always improve reasoning a… Do reasoning models switch between ideas too frequ… When should an agent actually stop and deliberate?

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

ReBalance uses confidence as continuous indicator to dynamically steer between overthinking and underthinking — training-free balanced reasoning via hidden state steering vectors