Why do language models fail to act on their own reasoning?
LLMs generate correct step-by-step reasoning 87% of the time but only follow through with matching actions 64% of the time. What drives this gap between knowing and doing?
Three systematic failure modes explain why LLMs perform sub-optimally in sequential decision-making: greediness (premature commitment to exploitative strategies, leaving up to 55% of the action space unexplored), frequency bias (small models copying the most frequent actions regardless of reward), and the knowing-doing gap (producing correct rationales but failing to act on them).
The knowing-doing gap is the most conceptually significant finding. When LLMs generate chain-of-thought rationales about how to solve a decision-making task, 87% of the rationales are correct — yet only 64% of the subsequent actions follow the rationale's recommendation. The model knows what to do but defaults to greedy behavior instead of following its own reasoning.
Scale partially helps: larger models (27B) diminish frequency bias but remain greedy. RL fine-tuning on self-generated CoT rationales narrows all three gaps by increasing exploration and aligning actions with rationales. This suggests the gap is trainable, not architectural.
This connects directly to the concept of Potemkin understanding. Since Can LLMs understand concepts they cannot apply?, the knowing-doing gap is a measurable instance of exactly this pattern — the model demonstrates understanding in its rationale but fails in its action selection. The quantified gap (87% vs 64%) gives the Potemkin understanding concept empirical grounding.
The deeper implication is that CoT reasoning and action selection may involve different computational pathways. Since Do language models actually use their encoded knowledge?, the knowing-doing gap may reflect a disconnect where the reasoning trace is generated through one pathway while action selection draws on different (shallower, more habitual) computations.
Alice in Wonderland: the overconfidence amplifier. The "Alice in Wonderland" paper demonstrates a dramatic instance of the knowing-doing gap on trivially simple reasoning: "Alice has N brothers and M sisters. How many sisters does Alice's brother have?" Most SOTA models collapse entirely on this simple problem, producing incorrect answers with strong overconfidence while providing "reasoning-like explanations akin to confabulations" to justify clearly failed responses. Standard interventions (enhanced prompting, multi-step re-evaluation) fail to recover correct answers. The confabulation-like quality of the justifications directly parallels the knowing-doing gap: the model generates plausible reasoning traces that do not correspond to correct computation. Notable exceptions are Claude 3 Opus and GPT-4 which occasionally succeed — but still show frequent failures, suggesting the problem is architectural, not model-specific.
Source: Reinforcement Learning; enriched from Flaws
Related concepts in this collection
-
Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
instantiates: the 87%/64% gap is a quantified example of Potemkin understanding
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
explains: action selection may bypass the reasoning trace entirely
-
Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
parallels: both show reasoning traces decoupled from downstream behavior
-
Does RL teach reasoning or teach when to use it?
Post-training RL gets credit for building reasoning into language models, but emerging evidence suggests base models already possess this capability. The question is whether RL creates new reasoning skills or simply teaches deployment timing.
complicates: RL fine-tuning can narrow the knowing-doing gap, suggesting RL does teach something beyond timing
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
the knowing-doing gap (87% rationales vs 64% actions) is an empirical instance of the generation-verification gap in decision-making: RL fine-tuning narrows this gap, consistent with the formal prediction that self-improvement operates precisely where verification exceeds generation
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llms are greedy agents with a knowing-doing gap — correct rationales 87 percent but greedy actions 64 percent