The Illusion of the Illusion of the Illusion of Thinking

Paper · Source
FlawsReasoning Critiques

"Shojaee et al.’s underlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in sustained, high-fidelity, step-by-step execution.

The true illusion is the belief that any single evaluation paradigm can definitively distinguish between reasoning, knowledge retrieval, and pattern execution.

"inadvertently shift the goalposts of what is being measured.... the truth is more nuanced than either “fundamental failure” or “simple artifact.” ...while the dramatic “collapse” is indeed an illusion, the original experiments, when viewed through the lens of the critique, still point toward genuine and important limitations in the execution of complex, sequential tasks."

solution length (Shojaee et al.’s primary metric for complexity) is not equivalent to computational difficulty.

  1. The Fragility of Sustained Execution: The fact that models fail at

high-iteration sequential tasks, even when the underlying logic is simple

(like Tower of Hanoi), points to a weakness in sustained, step-by-step

processing. While the hard token limit is the ultimate cause of failure in the

experiment, the enormous token cost itself is a symptom of how LLMs represent

and execute such problems. A system with more robust internal state-tracking

might execute the steps more efficiently.

  1. Unexplained Behavioral Ticks: Opus

and Lawsen’s critique does not fully account for one of Shojaee et al.’s most

intriguing findings: that near the collapse point, LRMs

“begin reducing their reasoning effort (measured by inference-time tokens)”. If

 the issue were merely hitting a hard output limit, one might expect models to

 consistently reason until that limit is reached. This counter-intuitive

 decline in effort on harder problems, also noted in other contexts

 [3], suggests a more complex behavioral scaling property that warrants further

 investigation.

  1. Data Contamination and Generalization: Shojaee et al.

 observed that models could handle a 100+ move Tower of Hanoi problem but

 failed a far shorter River Crossing problem. They speculate this is due to the

 prevalence of the former in training data. This highlights a key challenge in

 evaluation: distinguishing true, generalizable reasoning from sophisticated

 pattern matching of familiar problems, a core issue in compositionality

 [4]. Opus and Lawsen’s “generate a function” test, when applied to a very

 common problem, falls into this same ambiguity.