The Illusion of the Illusion of the Illusion of Thinking
"Shojaee et al.’s underlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in sustained, high-fidelity, step-by-step execution.
The true illusion is the belief that any single evaluation paradigm can definitively distinguish between reasoning, knowledge retrieval, and pattern execution.
"inadvertently shift the goalposts of what is being measured.... the truth is more nuanced than either “fundamental failure” or “simple artifact.” ...while the dramatic “collapse” is indeed an illusion, the original experiments, when viewed through the lens of the critique, still point toward genuine and important limitations in the execution of complex, sequential tasks."
solution length (Shojaee et al.’s primary metric for complexity) is not equivalent to computational difficulty.
- The Fragility of Sustained Execution: The fact that models fail at
high-iteration sequential tasks, even when the underlying logic is simple
(like Tower of Hanoi), points to a weakness in sustained, step-by-step
processing. While the hard token limit is the ultimate cause of failure in the
experiment, the enormous token cost itself is a symptom of how LLMs represent
and execute such problems. A system with more robust internal state-tracking
might execute the steps more efficiently.
- Unexplained Behavioral Ticks: Opus
and Lawsen’s critique does not fully account for one of Shojaee et al.’s most
intriguing findings: that near the collapse point, LRMs
“begin reducing their reasoning effort (measured by inference-time tokens)”. If
the issue were merely hitting a hard output limit, one might expect models to
consistently reason until that limit is reached. This counter-intuitive
decline in effort on harder problems, also noted in other contexts
[3], suggests a more complex behavioral scaling property that warrants further
investigation.
- Data Contamination and Generalization: Shojaee et al.
observed that models could handle a 100+ move Tower of Hanoi problem but
failed a far shorter River Crossing problem. They speculate this is due to the
prevalence of the former in training data. This highlights a key challenge in
evaluation: distinguishing true, generalizable reasoning from sophisticated
pattern matching of familiar problems, a core issue in compositionality
[4]. Opus and Lawsen’s “generate a function” test, when applied to a very
common problem, falls into this same ambiguity.