LLM Reasoning and Architecture Reinforcement Learning for LLMs

Do reasoning models switch between ideas too frequently?

Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search

"Thoughts Are All Over the Place" identifies a failure mode complementary to but distinct from overthinking: underthinking. Where overthinking generates excessively long traces, underthinking generates traces that switch between reasoning directions too frequently, failing to follow any promising path to completion.

The empirical finding: frequent thought switching correlates with incorrect responses across multiple o1-like models on challenging mathematical test sets. The model starts down one reasoning path, encounters difficulty, switches to a different approach, encounters difficulty there too, switches again — never committing enough depth to any single path to reach a solution.

A novel metric quantifies this: token efficiency in incorrect answers, measuring how much of the reasoning trace was "wasted" on abandoned approaches versus productively advancing toward a solution.

TIP (Thought-switching Penalty) is a pure decoding strategy — no model fine-tuning required. During generation, it penalizes the probability of tokens that signal thought transitions (linguistic markers like "Alternatively," "Let me try," "Wait"), encouraging the model to continue exploring the current path rather than jumping to a new one. The result: accuracy improves across challenging datasets.

This reframes the overthinking/underthinking relationship. They are not opposites on a single dimension (trace length). Overthinking is excessive computation within a committed path. Underthinking is insufficient computation per path due to premature switching. A model can simultaneously overthink (too many tokens total) and underthink (too few tokens per path) — producing a long trace that wanders between incomplete approaches.

The connection to Why do reasoning LLMs fail at deeper problem solving? is direct: premature thought switching is one mechanism that produces wandering behavior. The "unnecessary exploration" failure mode is exactly what happens when the model abandons productive branches for new ones without sufficient exploration.

Source: Reasoning o1 o3 Search

Related concepts in this collection

Why do reasoning LLMs fail at deeper problem solving? Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
underthinking via switching is one mechanism producing the wandering pattern
Does self-revision actually improve reasoning in language models? When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
self-revision is a specific form of switching: the model revises (switches away from) an answer rather than deepening its current approach
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the threshold may partly reflect switching overhead: tokens spent on transitions rather than productive reasoning
Why do correct reasoning traces contain fewer tokens? In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
incorrect traces are longer partly because switching generates wasted tokens
Can minimal reasoning chains match full explanations? Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
CoD addresses underthinking from the format side: minimal per-step drafts enforce depth within each step by eliminating the verbose intermediate context that enables thought-switching; where TIP penalizes switching tokens at decoding time, CoD prevents the runway for switching in the first place
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
training-time analog: entropy collapse reduces exploration diversity during training (narrowing the strategy repertoire), while underthinking reduces exploration depth during inference (abandoning strategies prematurely); both are exploration-exploitation failures at different timescales
Do iterative refinement methods suffer from overthinking? Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
timescale generalization: underthinking operates within a single inference call (switching between reasoning threads); iterative refinement reproduces the same switching pattern across multiple inference calls — TIP-like penalties on transition tokens may apply at both timescales
Why do reasoning models overthink ill-posed questions? Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
distinct but related failure: underthinking is premature switching between approaches (too shallow per path); overthinking on missing premises is inability to disengage (no valid path exists); both reveal the model lacks metacognitive control over its reasoning allocation

Concept map

20 direct connections · 145 in 2-hop network ·medium cluster

Do reasoning models switch between ideas too fre… Why do reasoning LLMs fail at deeper problem solvi… Does self-revision actually improve reasoning in l… Does more thinking time always improve reasoning a… Why do correct reasoning traces contain fewer toke… Can minimal reasoning chains match full explanatio… Does policy entropy collapse limit reasoning perfo… Do iterative refinement methods suffer from overth… Why do reasoning models overthink ill-posed questi…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

underthinking is premature thought switching — penalizing reasoning transitions improves accuracy without fine-tuning