LLM Reasoning and Architecture Reinforcement Learning for LLMs

Why do reasoning models abandon promising solution paths?

Explores whether reasoning models fail because they think insufficiently or because they structurally misorganize their thinking. Challenges the assumption that longer reasoning traces automatically improve performance.

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search

The dominant narrative about reasoning models: they think step by step, explore the solution space, and arrive at answers through deliberation. The reality: they wander.

The formalization. Systematic exploration requires three properties: validity (following legal transitions), effectiveness (reaching goals), and necessity (no wasted states). Current reasoning LLMs fail all three. A model performing DFS on a binary tree of depth d with branch-omission probability pw sees success drop exponentially: problems that look tractable at depth 5 become impossible at depth 15.

The complementary failure. Separately, o1-like models exhibit "underthinking" — not too little total reasoning, but too little depth per reasoning thread. The model starts down a promising path, encounters difficulty, switches to another approach, encounters difficulty there, switches again. The result is a long trace (many tokens) with shallow exploration (insufficient depth on any single path).

Why both matter together. Wandering and underthinking are not the same failure mode, but they reinforce each other. A model that switches approaches prematurely (underthinking) generates more abandoned branches to wander between (wandering). More compute doesn't fix either — a wandering model given more tokens wanders more extensively, and an underthinking model given more tokens switches more frequently.

The practical fix is surprising. TIP (Thought-switching Penalty) is a pure decoding strategy that penalizes tokens signaling thought transitions. It improves accuracy without fine-tuning — just by encouraging the model to stay on its current path longer. The implication: the model often had a viable path and abandoned it prematurely. The answer was reachable from the original approach.

This reframes the entire "scale inference compute" research program. The bottleneck is not how much the model thinks — it is how it structures its thinking. A tourist visiting more landmarks is not the same as a scientist following a hypothesis to its conclusion.

Supporting material:

Source: Reasoning o1 o3 Search

Original note title

the wandering mind — why reasoning models explore like tourists not scientists