Do reasoning failures stem from strategy or from calculation breakdown?
This explores a clean either/or — are reasoning failures about bad strategy (poor exploration, premature path-switching) or about breaking down on the mechanical step-by-step work (execution, calculation) — and the corpus suggests the dichotomy itself is the wrong frame.
Read literally, the question asks you to pick a side: do models fail because they choose badly (strategy) or because they can't carry out the steps (calculation)? The corpus has strong, conflicting evidence for *both* — which is the interesting part. One camp locates failure in execution. Models often *know* the algorithm but cannot run it across many steps in text-only generation; give them a tool and they sail past the supposed 'reasoning cliff,' which says the bottleneck was procedural bandwidth, not thinking Are reasoning model collapses really failures of reasoning?. The other camp locates failure in strategy: reasoning models 'wander' through invalid exploration and abandon promising paths too early ('underthinking'), and you can fix a chunk of it at decoding time with a thought-switching penalty — no retraining — which means viable solutions existed but were discarded by bad search Why do reasoning models abandon promising solution paths? Why do reasoning LLMs fail at deeper problem solving?.
But several notes dissolve the strategy-vs-calculation split into a third thing entirely: *familiarity*. Models don't break at a complexity threshold or a calculation limit — they break at the edge of what they've seen. Reasoning chains succeed whenever the instance resembles training data, regardless of length, because the model is fitting instance-level patterns rather than running a general algorithm Do language models fail at reasoning due to complexity or novelty?. Trace length tells the same story from another angle: it tracks proximity to the training distribution, not problem difficulty, decoupling completely out-of-distribution Does longer reasoning actually mean harder problems?. If you take this seriously, both 'strategy' and 'calculation' are surface symptoms — the model is recalling schemas, and when recall misses, it *looks* like a strategy failure on some problems and a calculation failure on others.
That reframing has teeth because of what chain-of-thought actually is. If CoT is constrained imitation — pattern-matching the *shape* of reasoning rather than performing inference — then structural coherence can stay intact while content quietly goes wrong, which is exactly why a trace can read as fluent strategy while the calculation underneath is hollow Why does chain-of-thought reasoning fail in predictable ways?. This is also why *where you look* changes the diagnosis. Scoring only the final answer hides the failure; checking intermediate states reveals that most breakdowns are process violations, and verifying mid-trace lifted task success from 32% to 87% Where do reasoning agents actually fail during long traces?. The strategy/calculation question is partly an artifact of measurement granularity.
A quieter finding cuts across all of it: more reasoning is not more reliable reasoning. Accuracy follows an inverted U — it peaks at intermediate length and then *declines*, with one benchmark dropping from 87% to 70% as thinking tokens ballooned from ~1,100 to ~16K Does more thinking time always improve reasoning accuracy? Why does chain of thought accuracy eventually decline with length?. Longer chains create more 'corruption surfaces' Where exactly do reasoning models fail and break?. So a failure that looks like a calculation slip late in a long trace may really be a strategy failure upstream — choosing to think too long. And whether thinking helps at all is mediated by training: vanilla models use extended thinking to spiral into self-doubt, while RL training redirects the same mechanism into productive gap analysis Does extended thinking help or hurt model reasoning?.
The thing you didn't know you wanted to know: the most promising fixes don't choose between strategy and calculation — they restructure the search so the two can't compound. Training abstraction generators alongside solution generators enforces breadth-first exploration, spending test-time compute on *diverse* approaches rather than drilling deeper into one, which directly prevents the underthinking trap Can abstractions guide exploration better than depth alone?. The frontier answer to 'strategy or calculation?' is 'neither, in isolation — fix the exploration structure and the familiarity gap, and both failure modes shrink together.'
Sources 12 notes
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Research reveals four core failure modes: exploration wandering rather than systematic search, premature thought switching, poor hybrid reasoning mode selection, and surprising deficits in social cognition despite excelling at formal tasks. Longer reasoning chains create more corruption surfaces.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.