Reinforcement Learning for LLMs LLM Reasoning and Architecture

Why do reasoning LLMs fail at deeper problem solving?

Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search

"Reasoning LLMs are Wandering Solution Explorers" provides the most rigorous formalization yet of why reasoning models fail as problem complexity increases. The claim: current RLLMs do not systematically explore solution spaces. They wander.

Systematic exploration requires three properties: (a) validity — the trace follows the reachability structure; (b) effectiveness — the trace contains at least one goal state; (c) necessity — every state in the trace contributes to goal discovery or dead-end elimination. Current models fail all three.

The formalization makes the failure quantifiable. A wandering RLLM performing depth-first search on a binary tree of depth d has a probability pw of omitting one of two child nodes at each decision point. The success probability drops exponentially with depth d. This is not a gradual degradation — it is catastrophic. Problems that appear within reach at depth 5 become virtually impossible at depth 15 not because the model lacks reasoning ability but because it lacks search discipline.

Four failure modes are identified:

Invalid exploration: transitions violate the problem's reachability structure
Unnecessary exploration: superfluous states that don't contribute to goal discovery
Evaluation error: misinterpreting current state or executing planned moves erroneously
Hallucinated conclusions: claiming solutions that don't satisfy problem constraints

The finding directly challenges the "more thinking tokens = better reasoning" narrative. A wandering model given more tokens doesn't explore more systematically — it wanders more extensively. This is the mechanism behind Does more thinking time always improve reasoning accuracy?: additional compute doesn't fix structural search deficiency.

The exponential degradation result connects to Does policy entropy collapse limit reasoning performance in RL?. Entropy collapse reduces exploration diversity during training; wandering reduces exploration discipline during inference. Both are manifestations of the same problem: the model converges on familiar patterns rather than systematically covering the solution space.

Apple's three-regime confirmation. "The Illusion of Thinking" (Apple) provides independent confirmation through controllable puzzle environments with precise complexity manipulation. Three performance regimes emerge: (1) low-complexity — standard models outperform reasoning models with greater token efficiency; (2) medium-complexity — reasoning models gain advantage through extended thinking; (3) high-complexity — both model types collapse to zero. Near the collapse point, reasoning models reduce their reasoning effort despite having ample token budget — a counterintuitive behavioral scaling limit. Even providing explicit optimal algorithms does not prevent collapse, confirming the bottleneck is execution not conceptualization. The three-regime structure refines the wandering explorer thesis: wandering is harmful at low complexity (overthinking easy problems), partially beneficial at medium complexity (exploring toward solutions), and irrelevant at high complexity (no amount of wandering reaches the goal).

Source: Reasoning o1 o3 Search; enriched from Flaws

Related concepts in this collection

Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
this provides the mechanism: additional tokens fund wandering, not systematic exploration
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
training-time collapse mirrors inference-time wandering
Does self-revision actually improve reasoning in language models? When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
self-revision is a specific form of wandering: revisiting explored states rather than covering new ones
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
parallel chains explore independently and thus cover more space than a single wandering chain
Does outcome-based RL diversity loss spread across unsolved problems? When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?
training-time cause of inference-time wandering: outcome-based RL suppresses exploration diversity during training, which means the model enters inference with a narrowed repertoire of search strategies — wandering is partly a consequence of having lost systematic search diversity during RL training
Can evolutionary search beat sampling and revision at inference time? Can LLMs evolve populations of solutions through recombination and selection to outperform simpler inference strategies? This matters because it could reveal whether biological-inspired search improves planning without formal problem definitions.
architectural response to wandering: Mind Evolution's island-model population diversity maintains exploration discipline through parallel sub-populations that prevent the premature convergence and systematic exploration failure that single-trajectory wandering exhibits
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
complementary failure mode: wandering is insufficient spatial coverage of the solution space; underthinking is insufficient depth on any single path; a model can exhibit both simultaneously, producing long traces that wander between shallow explorations
Why do reasoning models fail differently at training versus inference? Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
wandering is an inference-time manifestation of the exploration-exploitation failure; entropy collapse at training time narrows the repertoire of search strategies, while wandering at inference time reflects the lack of systematic discipline those strategies would provide

Concept map

20 direct connections · 186 in 2-hop network ·dense cluster

Why do reasoning LLMs fail at deeper problem sol… Does more thinking time always improve reasoning a… Does policy entropy collapse limit reasoning perfo… Does self-revision actually improve reasoning in l… Why does parallel reasoning outperform single chai… Does outcome-based RL diversity loss spread across… Can evolutionary search beat sampling and revision… Do reasoning models switch between ideas too frequ… Why do reasoning models fail differently at traini…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

reasoning llms are wandering explorers not systematic searchers — four failure modes degrade success probability exponentially with problem depth