Can large language models actually create executable plans?
Do LLMs genuinely assemble plans that work, or just generate planning-domain knowledge that sounds coherent? Understanding this distinction matters for deploying AI in real planning tasks.
Solving planning tasks requires two distinct capabilities: (a) having planning domain knowledge — actions, preconditions, effects, hierarchical recipes, past cases — and (b) assembling that knowledge into an executable plan that handles subgoal and resource interactions. LLMs are strong at (a) and fail at (b). Only about 12% of plans that GPT-4 generates are actually executable without errors and goal-reaching.
The confusion between these two capabilities explains much of the conflicting literature on LLM planning. "Many papers claiming planning abilities of LLMs, on closer examination, wind up confusing general planning knowledge extracted from the LLMs for executable plans. When all we are looking for are abstract plans, such as 'wedding plans,' with no intention of actually executing said plans directly, it is easy to confuse them for complete executable plans."
Self-critiquing makes things worse, not better. LLMs "hallucinate both false positives and false negatives while verifying the solutions they generate." With self-verification, performance actually diminishes compared to systems with external sound verifiers. The nature of feedback — whether binary or detailed — shows minimal impact on generation, suggesting "the core issue lies in the LLM's binary verification capabilities rather than the granularity of feedback." Since Does self-revision actually improve reasoning in language models?, the self-critiquing failure in planning is the same mechanism operating on a different task type.
The proposed architecture is the LLM-Modulo framework: a generate-test-critique loop where LLMs generate candidate plans and a bank of external critics evaluates them. LLMs play multiple roles — guessing candidates, translating formats, helping users flesh out specifications, helping experts acquire domain models — but are never ascribed planning or verification abilities. Plans produced by this compound system have formal soundness guarantees because of the external critics.
Kambhampati's framing is precise: "LLMs are amazing giant external non-veridical memories that can serve as powerful cognitive orthotics for human or machine agents, if rightly used." The "non-veridical" is key — LLMs reconstruct completions probabilistically rather than indexing and retrieving exactly. "The boon ('creativity') and bane ('hallucination') of LLMs is that n-gram models will naturally mix and match."
Since Can language models understand without actually executing correctly?, the planning finding is a specific instance: LLMs comprehend planning domains (extract valid action descriptions, preconditions, effects) without being competent to execute plans (sequence actions that handle interactions and constraints). Since Why do language models fail to act on their own reasoning?, planning adds a third data point to the knowing-doing gap: 87% correct rationales in sequential decisions, 64% correct actions, and now 12% executable plans — the gap widens as task complexity increases.
Source: Tasks Planning Papers: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks, Can Large Language Models Really Improve by Self-critiquing Their Own Plans?, Can Large Language Models Reason and Plan?
Related concepts in this collection
-
Can language models understand without actually executing correctly?
Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.
planning is the paradigmatic case: comprehension of domain without competence to execute
-
Why do language models fail to act on their own reasoning?
LLMs generate correct step-by-step reasoning 87% of the time but only follow through with matching actions 64% of the time. What drives this gap between knowing and doing?
planning extends the gap: 87% → 64% → 12% as complexity increases
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
self-critiquing failure generalizes from reasoning to planning
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
planning heuristics without world models explains why knowledge extraction works but plan assembly fails
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
LLMs confuse planning knowledge for executable plans — only 12 percent of GPT-4 generated plans are executable and self-critiquing worsens performance