A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap

Paper · arXiv 2506.18957 · Published June 23, 2025

The recent work by Shojaee et al. (2025), titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, presents a compelling empirical finding, a reasoning cliff, where the performance of Large Reasoning Models (LRMs) collapses beyond a specific complexity threshold, which the authors posit as an intrinsic scaling limitation of Chain-of-Thought (CoT) reasoning. This commentary, while acknowledging the study's methodological rigor, contends that this conclusion is confounded by experimental artifacts. We argue that the observed failure is not evidence of a fundamental cognitive boundary, but rather a predictable outcome of system-level constraints in the static, text-only evaluation paradigm, including tool use restrictions, context window recall issues, the absence of crucial cognitive baselines, inadequate statistical reporting, and output generation limits. We reframe this performance collapse through the lens of an agentic gap, asserting that the models are not failing at reasoning, but at execution within a profoundly restrictive interface. We empirically substantiate this critique by demonstrating a striking reversal. A model, initially declaring a puzzle impossible when confined to text-only generation, now employs agentic tools to not only solve it but also master variations of complexity far beyond the reasoning cliff it previously failed to surmount. Additionally, our empirical analysis of tool-enabled models like o4-mini and GPT-4o reveals a hierarchy of agentic reasoning, from simple procedural execution to complex meta-cognitive self-correction, which has significant implications for how we define and measure machine intelligence. The “illusion of thinking” attributed to LRMs is less a reasoning deficit and more a consequence of an otherwise capable mind lacking the tools for action.

On simpler problems, LRMs often exhibit "overthinking," where they find the correct solution early in the reasoning trace but continue to waste compute by exploring incorrect alternatives. As problems become moderately more complex, this trend reverses, and correct solutions tend to emerge only later in the thought process, after extensive exploration of incorrect paths. At the highest complexities, models fail to find any correct solutions within their reasoning traces.

The most revealing finding in Shojaee et al. (2025), and the starting point for our own analysis, is one that isolates the bottleneck of LRM performance: providing models like Claude 3.7 Sonnet-Thinking with an explicit, optimal algorithm for the Tower of Hanoi does not prevent performance collapse. This suggests the bottleneck is not a lack of conceptual knowledge, but a failure of execution. The crucial missing piece of context is that the LRMs are being forced to perform this execution in an environment that is profoundly restricted. The experimental setup forbids the LRMs to write and execute code to solve the puzzles. This prohibition forces the LRM into the role of a "human simulator," painstakingly transcribing thousands of discrete steps, rather than allowing it to function as a "problem solver," which would naturally offload such procedural execution to a more suitable tool (Xi et al., 2025; Qu et al., 2025; Patil et al., 2023).

This exposes the core issue: the observed failures are not of reasoning in abstracto, but of agency. Modern LRMs are powerful cognitive engines (Camposampiero et al., 2025; El-Kishky et al., 2025; Kambhampati et al., 2025; Liu et al., 2025; Niu et al., 2024), but the static, text-only interface creates an "agentic gap" (Wang et al., 2024). However, our empirical results show that simply bridging this gap with tools is not sufficient. The way in which a model utilizes those tools reveals a clear hierarchy of reasoning, from simple procedural execution to complex, meta-cognitive selfcorrection. Drawing from the lexicon of cognitive science, we can frame this distinction as one between First-Order Agency, the direct application of a devised strategy to act upon the world, and Second-Order Agency, the capacity to reflect upon, evaluate, and revise one's own internal strategies and thought processes.

The results were telling. The model not only failed to produce a valid sequence of moves for any non-trivial case (e.g., 𝑁 = 5 pairs / 𝑘 = 3 person boat) but, after generating extensive yet flawed reasoning traces, repeatedly and confidently concluded in some instances that these solvable problems were "logically impossible." This reveals a critical failure mode: the model's inability to perfectly maintain state and exhaust a complex search space autoregressively leads it to mistake its own executional limitations for fundamental impossibilities of the puzzle itself (mlsubmission, 2025; OpenAI., 2025a). This baseline demonstrates that the non-agentic interface described by Shojaee et al. (2025) is so restrictive that it prevents even the most basic form of problem-solving execution, forcing the model into a state analogous to learned helplessness, where the LRM incorrectly generalizes from its inability to act to the conclusion that the task itself is unsolvable (Maier et al., 1976).

In stark contrast, the o4-mini LRM demonstrates Second-Order Agency, a process that more closely resembles the deliberate, analytical "System 2" of dual-process theory (Kahneman, 2011). It too begins with a flawed algorithmic hypothesis (Figure 1D), suggesting that its initial intuitive approach is not infallible. However, its use of the Python tool is more sophisticated; it is not merely for execution, but for verification. The model's own simulation correctly detects the failure of its initial plan. This moment of self-generated negative feedback, a form of cognitive dissonance between its intended outcome and the actual result, triggers a meta-cognitive self-correction, a hallmark of higher-order thinking (Flavell, 1979). The LRM exhibits cognitive flexibility by overcoming its initial fixation, discarding the failed strategy, and selecting an entirely new, correct "paired-couples" algorithm (OpenAI., 2025c), which it then successfully validates (Figure 1E) before generating the final solution (Figure 1F). This iterative loop of plan, test, fail, and revise mirrors the process of deliberate practice in human expertise, where feedback is actively sought and used to refine performance (Ericsson et al., 1993). From the perspective of Newell and Simon's problem space theory (1972), GPT-4o performs a limited search and gets trapped on an incorrect path, whereas o4-mini demonstrates a more sophisticated search, capable of backtracking from a failed state to explore a completely different and ultimately successful branch of the problem space.

The analysis tends to treat difficulty as a uniform axis controlled by 'N', but the formal properties of the puzzles suggest the models are succumbing to different cognitive pressures. The early collapse on River Crossing is not just a function of move count but is indicative of a failure to manage the complex, global state constraints inherent to a PSPACE-complete problem. In contrast, failures in Blocks World are more likely tied to the breakdown of strategic, non-linear planning required.

The central question for the field should evolve from a binary "Can models reason?"

to a more nuanced "What kind of reasoners are they, and under what conditions can they ascend the agentic hierarchy?"

• Probing the Agentic Boundary: What are the specific task properties or model characteristics that differentiate First-Order from Second-Order agency? Future work should design benchmarks that specifically target metacognitive functions like error detection, strategy-switching, and uncertainty estimation.

• Inducing Higher-Order Agency: Can models exhibiting only First-Order agency be trained to achieve Second- Order capabilities? This could involve novel training techniques, such as reinforcement learning with metacognitive rewards or fine-tuning on datasets that explicitly demonstrate self-correction and strategic adaptation.

• Architectural Correlates of Agency: Are there specific LRM’s training components or scaling properties that correlate with the emergence of Second-Order agency? Understanding these links is crucial for building more reliable and capable reasoning systems.

• Implications for AI Safety and Reliability: A model limited to First-Order agency poses a distinct safety risk: it may confidently execute a flawed or harmful plan without the capacity for self-correction. Fostering Second8 Order agency is therefore not just a matter of performance, but a critical step toward building more robust and trustworthy AI.