Reasoning LLMs are Wandering Solution Explorers

Paper · arXiv 2505.20296 · Published May 26, 2025
Reasoning o1 o3 Search

However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderer rather than systematic explorers. Through qualitative and quantitative analysis across multiple state-of-the- art LLMs, we uncover persistent issues: invalid reasoning steps, redundant explorations, hallucinated or unfaithful conclusions, and so on. Our findings suggest that current models’ performance can appear to be competent on simple tasks yet degrade sharply as complexity increases. Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.

This paper challenges this hope by pointing out that the “longer thinking” strategy employed by existing reasoning LLMs (RLLMs) does not necessarily make them “think better”. In fact, they are wandering in the solution space. Specifically, we argue that a “better” or systematic solution exploration should satisfy a few properties, namely, validity, effectiveness, and necessity, which is missing in all existing RLLMs. Through a set of experiments on a variety of computation problems, we empirically show that none of the existing RLLMs demonstrate systematic problem solving capabilities consistently over different problem classes and scales. Their failure modes, such as missing key solution candidates, hallucinating invalid candidates, or repeated exploration, suggest that RLLMs are wandering rather than exploring the solution space structurally.

An RLLM maps a problem to a solution, by producing a series of reasoning steps that starts from the known information and end at the goal defined by the problem specifications. Each reasoning step corresponds to a state, which represents what information has been derived from the knowns and what derivations it could do in the next step. Essentially, all the reasoning steps form a trace in the solution space, which we call an exploration.

depth-first search (DFS) on a binary tree of depth d to find any one of m designated target leaves. This task represents a problem requiring at least d binary decisions, with m valid solutions among 2d possible leaves. An RLLM that performs DFS-based systematic exploration is guaranteed to succeed.

a wandering RLLM that, at each decision point, has a probability pw of omitting one of the two child nodes – i.e., it fails to explore that branch and all its descendants, thereby risking overlooking a possible solutions.

Eq. (1) reveals that success probability drops exponentially with d for wandering RLLMs.

• Invalid Exploration: when an RLLM generates a trace that violates the validity condition –

i.e., the transitions between states do not conform to the problem’s reachability structure as defined by the function T.

• Unnecessary Exploration: when an RLLM’s trace violates the necessity property – i.e., it

contains superfluous states that do not contribute to goal discovery or dead-end elimination.

• Evaluation Error: when (1) an RLLM misinterprets its current state or its role in the

problem structure, leading to incorrect next-step move, or (2) an RLLM executes the

planned move erroneously even with correct modelling of its current state. For example, it

could make mistakes when doing calculation or information look-up.

Definition 2 (Systematic Exploration). An exploration is said to be systematic if its trace J satisfies the following three properties: (a) validity: J must follow the reachability structure defined in T; (b) effectiveness: J must contain at least one goal, i.e., ∃sji ∈ J such that sji ∈ G; and (c) necessity: every state sji ∈ J must be necessary. A state sji is necessary if for all subsequences J′ ⊆ J containing sji , removing J′ from J makes the remaining trace J \ J′ either invalid or contains fewer goal or dead-end states than J.