Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can confidence patterns reveal overthinking versus underthinking?

This explores whether real-time confidence signals can diagnose when a reasoning model is trapped in redundant deliberation versus committing prematurely, and whether steering based on these signals can balance both failure modes.

Note · 2026-04-01 · sourced from Reasoning by Reflection
When does thinking too much actually hurt reasoning?

Overthinking and underthinking are dual failures, and existing methods that suppress one often induce the other. Suppressing reflective keywords or truncating reasoning length reduces overthinking but causes underthinking — the model doesn't explore enough. Forcing longer chains reduces underthinking but generates redundancy. ReBalance resolves this by treating confidence as a continuous diagnostic signal rather than using binary interventions.

The diagnostic: Confidence values correlate with reasoning behavior in interpretable ways:

The mechanism: From a small-scale dataset, identify reasoning steps indicating each mode. Aggregate their hidden states into reasoning mode prototypes. Compute a steering vector encoding the transition from overthinking to underthinking. A dynamic control function modulates the vector's strength and direction based on real-time confidence: pruning redundancy during overthinking, promoting exploration during underthinking.

Why it's training-free: The steering vector captures the model's inherent reasoning dynamics — it's extracted from the model's own hidden states, not trained. Because it operates on intrinsic representations, it generalizes across unseen data and tasks (math, QA, coding). This makes it plug-and-play across models from 0.5B to 32B.

Since Can we steer reasoning toward brevity without retraining?, ReBalance extends the activation-steering approach from length compression to reasoning quality management. ASC steers between verbose and concise modes; ReBalance steers between overthinking and underthinking — a qualitative distinction, not just quantitative.

Since Does more thinking time always improve reasoning accuracy?, ReBalance provides the dynamic mechanism the threshold finding calls for: instead of a fixed cutoff, confidence-based steering continuously adjusts the reasoning trajectory.


Source: Reasoning by Reflection Paper: Efficient Reasoning with Balanced Thinking

Related concepts in this collection

Concept map
12 direct connections · 103 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

ReBalance uses confidence as continuous indicator to dynamically steer between overthinking and underthinking — training-free balanced reasoning via hidden state steering vectors