Reinforcement Learning for LLMs LLM Reasoning and Architecture

Why do reasoning LLMs fail at deeper problem solving?

Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time?

"Reasoning LLMs are Wandering Solution Explorers" provides the most rigorous formalization yet of why reasoning models fail as problem complexity increases. The claim: current RLLMs do not systematically explore solution spaces. They wander.

Systematic exploration requires three properties: (a) validity — the trace follows the reachability structure; (b) effectiveness — the trace contains at least one goal state; (c) necessity — every state in the trace contributes to goal discovery or dead-end elimination. Current models fail all three.

The formalization makes the failure quantifiable. A wandering RLLM performing depth-first search on a binary tree of depth d has a probability pw of omitting one of two child nodes at each decision point. The success probability drops exponentially with depth d. This is not a gradual degradation — it is catastrophic. Problems that appear within reach at depth 5 become virtually impossible at depth 15 not because the model lacks reasoning ability but because it lacks search discipline.

Four failure modes are identified:

The finding directly challenges the "more thinking tokens = better reasoning" narrative. A wandering model given more tokens doesn't explore more systematically — it wanders more extensively. This is the mechanism behind Does more thinking time always improve reasoning accuracy?: additional compute doesn't fix structural search deficiency.

The exponential degradation result connects to Does policy entropy collapse limit reasoning performance in RL?. Entropy collapse reduces exploration diversity during training; wandering reduces exploration discipline during inference. Both are manifestations of the same problem: the model converges on familiar patterns rather than systematically covering the solution space.

Apple's three-regime confirmation. "The Illusion of Thinking" (Apple) provides independent confirmation through controllable puzzle environments with precise complexity manipulation. Three performance regimes emerge: (1) low-complexity — standard models outperform reasoning models with greater token efficiency; (2) medium-complexity — reasoning models gain advantage through extended thinking; (3) high-complexity — both model types collapse to zero. Near the collapse point, reasoning models reduce their reasoning effort despite having ample token budget — a counterintuitive behavioral scaling limit. Even providing explicit optimal algorithms does not prevent collapse, confirming the bottleneck is execution not conceptualization. The three-regime structure refines the wandering explorer thesis: wandering is harmful at low complexity (overthinking easy problems), partially beneficial at medium complexity (exploring toward solutions), and irrelevant at high complexity (no amount of wandering reaches the goal).


Source: Reasoning o1 o3 Search; enriched from Flaws

Related concepts in this collection

Concept map
20 direct connections · 186 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reasoning llms are wandering explorers not systematic searchers — four failure modes degrade success probability exponentially with problem depth