What explains the 87 percent to 12 percent cliff in plan executability?

This explores the gap between what LLMs can explain or know about a task (high marks — around 87%) and what they can actually execute as a working plan (a steep drop — as low as 12%), and what the corpus says causes that collapse.

This reads the question as the gap between knowing and doing: models that score ~87% when articulating principles or planning knowledge fall to ~12% when those plans have to actually run. The corpus is unusually direct about the cause, and it isn't a knowledge deficit. The cleanest statement of the mechanism is what one study calls a 'computational split-brain' — models articulate correct principles at 87% but apply them in action at 64%, because the pathway that understands instructions is structurally dissociated from the pathway that executes them Can language models understand without actually executing correctly?. The 12% figure comes from the planning side of the same coin: GPT-4 readily produces planning *knowledge*, but only 12% of its plans run without error, because executable planning requires assembling reasoning about how subgoals and resources interact — not just reciting what a good plan looks like Can large language models actually create executable plans?.

Why does competence collapse exactly where knowledge stays high? Several notes point to error accumulation across steps. A shift-cipher decomposition of chain-of-thought found that genuine reasoning does exist in the model, but it compounds error with every step — so a trace that's locally plausible drifts further from correct the longer it runs What three separate factors drive chain-of-thought performance?. That's the difference between explaining (one shot, no accumulation) and executing (many dependent steps, each a chance to derail). A related critique frames CoT as constrained imitation rather than abstract inference — the model pattern-matches the *shape* of reasoning, which looks competent until the structure has to bear real execution weight Why does chain-of-thought reasoning fail in predictable ways?.

There's also an architectural floor underneath the cliff. Autoregressive generation can't retract a token once emitted, but real execution — constraint satisfaction, planning under resource limits — depends on discarding bad partial assignments. So the model can *describe* a valid solution while being structurally unable to *search* for one, which is why LLMs plateau around 55–60% constraint satisfaction regardless of scale Why does autoregressive generation fail at constraint satisfaction? Do larger language models solve constrained optimization better?. Tellingly, this is not fixed by 'thinking harder': reasoning models with extended chains-of-thought produce more text, not more iterative computation, and don't systematically beat standard models on these execution-bound tasks Do reasoning models actually beat standard models on optimization?.

The most interesting turn — the thing you might not expect — is that the cliff is partly an artifact of *where we look for errors*, and it can be largely climbed back. When reliability is measured by scoring only the final answer, execution failures look like a wall. But when you verify the intermediate states and check policy compliance *during* generation, task success jumps from 32% to 87%, because most failures turn out to be process violations rather than wrong end-goals Where do reasoning agents actually fail during long traces?. Step-level confidence catches the breakdowns that global averaging hides Does step-level confidence outperform global averaging for trace filtering?, and architecturally, simply separating the planner from the executor — letting one model decompose and another solve — removes the interference and improves both accuracy and generalization Does separating planning from execution improve reasoning accuracy?.

So the 87-to-12 cliff is best read not as 'the model doesn't know the task' but as three stacked facts: knowledge and execution live in dissociated pathways, errors compound multiplicatively across execution steps, and the autoregressive architecture lacks the retraction primitive that real execution needs. The encouraging corollary is that intervening *in the process* — verifying mid-trace, filtering by step confidence, splitting planning from solving — recovers much of the lost ground that scaling and 'more reasoning' do not.

Sources 10 notes

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Can large language models actually create executable plans?

Only 12% of GPT-4 generated plans are actually executable without errors. LLMs excel at acquiring planning knowledge but fail at the reasoning assembly required to handle subgoal and resource interactions.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

What explains the 87 percent to 12 percent cliff in plan executability?

Sources 10 notes

Next inquiring lines