INQUIRING LINE

Can reasoning models succeed at logic but fail at execution?

This explores whether models that can reason correctly in the abstract still break down when they have to actually carry out the steps — separating 'knowing the method' from 'running it.'


This explores whether models that can reason correctly in the abstract still break down when they have to actually carry out the steps. The corpus says yes, emphatically — and several notes argue this split is the central thing we've been misreading as a 'reasoning' limit. The sharpest version is the finding that what look like reasoning collapses are really execution failures: text-only models often know the underlying algorithm but cannot run it across many steps, and the moment you hand them tools, they sail past the supposed 'reasoning cliff' Are reasoning model collapses really failures of reasoning?. A companion note names the phenomenon directly as a kind of split-brain: models articulate the right principle with ~87% accuracy but apply it correctly only ~64% of the time, a structural disconnect between the pathway that explains and the pathway that does Can language models understand without actually executing correctly?.

Once you accept that framing, a lot of other failures look less like 'the model can't think' and more like 'the model can't keep executing cleanly.' Reasoning models wander — they explore invalid paths and abandon promising ones too early — and you can claw back accuracy just by penalizing premature thought-switching at decode time, no retraining needed, which means the good solution was reachable but procedurally dropped Why do reasoning models abandon promising solution paths?. The same lack of systematic, valid, necessary search makes success probability fall off a cliff as problems get deeper, so medium puzzles are fine and deep ones are catastrophic Why do reasoning LLMs fail at deeper problem solving?. And on constraint-satisfaction problems that demand genuine backtracking, frontier models like o1-preview and DeepSeek-R1 stall at 20–23% — fluent reflection that simply doesn't convert into sustained problem-solving Can reasoning models actually sustain long-chain reflection?.

The most useful thing the corpus hands you is *where* to look for the failure: in the process, not the answer. Scoring only the final output misses the actual breakage, because most failures are process violations mid-trace — checking intermediate states and policy compliance during generation lifted task success from 32% to 87% Where do reasoning agents actually fail during long traces?. That reframes execution as something verifiable step-by-step rather than a black box you grade at the end.

There's a deeper, slightly unsettling current here too. If chain-of-thought is really pattern-matched imitation of reasoning structure rather than genuine inference Why does chain-of-thought reasoning fail in predictable ways?, and if deliberately corrupted traces teach almost as well as correct ones — because the trace works as computational scaffolding, not meaning Do reasoning traces need to be semantically correct? — then the 'logic' the model displays was never as separable from execution as it looks. Failures cluster not at complexity thresholds but at unfamiliar instances, suggesting models run memorized instance patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. And some apparent reasoning success is an illusion of a different kind: models can score well by conservatively defaulting to the safe option, scoring *worse* when constraints are removed, which means they were never evaluating the logic at all Are models actually reasoning about constraints or just defaulting conservatively?.

So the answer is layered. Yes — models can hold the right logic and still fail to execute it, and tool use plus process-level verification is how you recover that lost competence. But the corpus also pushes back on the clean dichotomy: in some cases the 'logic' was scaffolding or conservative defaulting all along, so the more honest question becomes how much of displayed reasoning was ever doing the work we credited it for.


Sources 10 notes

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Next inquiring lines