Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Paper · arXiv 2506.09250 · Published June 10, 2025

Note: this comment was debunked as AI generated and the math is bad

Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit ”accuracy collapse” on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments systematically exceed model output token limits at reported failure points, with models explicitly acknowledging these constraints in their outputs; (2) The authors’ automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N ≥ 6 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems.

A critical observation overlooked in the original study: models actively recognize when they approach output limits. A recent replication by @scaling01 on Twitter [2] captured model outputs explicitly stating ”The pattern continues, but to avoid making this too long, I’ll stop here” when solving Tower of Hanoi problems. This demonstrates that models understand the solution pattern but choose to truncate output due to practical constraints.

This mischaracterization of model behavior as ”reasoning collapse” reflects a broader issue with automated evaluation systems that fail to account for model awareness and decision-making. When evaluation frameworks cannot distinguish between ”cannot solve” and ”choose not to enumerate exhaustively,” they risk drawing incorrect conclusions about fundamental capabilities.

Shojaee et al.’s results demonstrate that models cannot output more tokens than their context limits allow, that programmatic evaluation can miss both model capabilities and puzzle impossibilities, and that solution length poorly predicts problem difficulty. These are valuable engineering insights, but they do not support claims about fundamental reasoning limitations.

Future work should:

Design evaluations that distinguish between reasoning capability and output constraints
Verify puzzle solvability before evaluating model performance
Use complexity metrics that reflect computational difficulty, not just solution length
Consider multiple solution representations to separate algorithmic understanding from execution

The question isn’t whether LRMs can reason, but whether our evaluations can distinguish reasoning from typing.