Why do language models fail at planning despite understanding strategies?
This explores the gap between a model that can describe a good strategy and one that actually executes it — why 'knowing the plan' and 'following the plan' come apart in LLMs.
This explores why language models can articulate a sound strategy yet fail to carry it out — and the corpus suggests the failure isn't a knowledge gap at all, but a structural disconnect between the part of the model that explains and the part that acts. The sharpest evidence is a measured split: models produce correct rationales about 87% of the time but follow their own reasoning only 64% of the time Why do language models fail to act on their own reasoning?. That same 87/64 fingerprint shows up framed as a 'computational split-brain' — instruction and execution running on dissociated pathways Can language models understand without actually executing correctly? — and as 'potemkin understanding,' where a model can explain a concept, fail to apply it, and even recognize its own failure, a triple pattern that doesn't look like anything in human cognition Can LLMs understand concepts they cannot apply?. So 'understanding the strategy' and 'planning' turn out to be different competencies, not two ends of one.
The knowing-doing gap has named symptoms. Models act greedily — grabbing the locally attractive move instead of the one their reasoning endorsed — and lean on frequency bias Why do language models fail to act on their own reasoning?. Planning is precisely where greed is fatal: it requires holding a longer arc and resisting the immediate payoff. You can watch this happen in real time in multi-turn settings, where models lock onto a premature assumption early and never recover, producing a 39% average performance drop that agentic mitigations only partly repair Why do language models fail in gradually revealed conversations?. A plan that commits to the wrong branch on turn two isn't a plan; it's a guess wearing a plan's clothing.
Planning also depends on exploration — trying things, tracking what happened, and updating — and that machinery is weak. In simple bandit tasks, LLMs only explore competently when handed external history summarization and explicit prompting; left to track their own unstructured interaction history, they can't aggregate it into good decisions Why do LLMs struggle with exploration in simple decision tasks?. Part of why is that in-context information loses to baked-in priors: when training associations are strong, the model overrides what's actually in front of it, and text prompting alone can't fix it Why do language models ignore information in their context?. A planner that can't keep the current situation in mind over the parametric defaults will keep re-solving the average case instead of this case.
There's a deeper reason the 'understanding' is shakier than it looks. Reasoning success tracks instance familiarity, not strategy: models fit instance-based patterns rather than general algorithms, so a chain succeeds when it resembles training instances and breaks at novelty boundaries regardless of length Do language models fail at reasoning due to complexity or novelty?. Viewed as autoregressive probability machines, their failures are even predictable — tasks with low-probability target outputs are systematically harder even when logically trivial Can we predict where language models will fail?. Planning is novel-instance generation almost by definition, which is why fluent strategy-talk doesn't transfer. Even 'strategic reasoning' is less a general faculty than a set of styles bound to game structure — different models default to minimax, trust, or belief-anticipation, and performance tracks the game type rather than raw reasoning depth Do large language models use one reasoning style or many?.
What closes the gap, then, isn't more eloquent reasoning — it's pressure on execution. Reinforcement learning narrows the knowing-doing gap directly Why do language models fail to act on their own reasoning?, and the broader lesson from agent training is that capability comes from scaling the *environment* — complexity, diversity, and real-world fidelity together — not the model; weakness in any one dimension collapses generalization What blocks scaling from language models to autonomous agents?. The thing you didn't know you wanted to know: the models that talk the best strategy aren't necessarily the ones learning to plan, because explanation is trained on cheap text while planning only gets built when something forces the model to live with the consequences of its own moves.
Sources 10 notes
LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
Across multi-armed bandit environments, only GPT-4 with explicit exploratory hints, external history summarization, and chain-of-thought reasoning achieves satisfactory exploration. Without external summarization, models cannot reliably track and aggregate unstructured interaction history to guide exploratory decisions.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.
Nex-N1 shows that autonomous agent performance depends on environment scaling along complexity, diversity, and real-world fidelity — not model size. Deficits in any single dimension collapse generalization, but scaling all three together enables frontier performance.