What cognitive capacities do LLMs actually lack that commentary assumes they have?
This explores the specific cognitive capacities that public discussion takes for granted in LLMs — genuine understanding, knowing, reasoning, and reading minds — but that the research corpus shows are systematically absent or dissociated from the surface fluency people mistake for them.
This question reads as: when people talk about LLMs as if they 'understand' or 'know' things, which of those assumed capacities does the research actually find missing? The corpus points to a recurring pattern — models are fluent at the *output* of a capacity while lacking the *machinery* commentary assumes produces it. The sharpest version is what one line calls a computational split-brain: models can state a correct principle (87% accuracy) but fail to act on it (64%), so the gap isn't missing knowledge but a structural disconnect between explanation and execution Can language models understand without actually executing correctly?. A related line names this 'Potemkin understanding' — a model explains a concept correctly, fails to apply it, *and* can recognize its own failure, a triple pattern that no coherent human understanding would produce Can LLMs understand concepts they cannot apply?.
The capacity most assumed and most absent is genuine knowing. Models track statistical regularities at high fidelity but show structurally specific failures — hallucination, reasoning collapse, sensitivity to how a premise is phrased — that mark the measurable gap between tracking patterns and actually knowing something What do language models actually know?. Pragmatic competence is another assumed capacity that turns out to be hollow: LLMs pattern-match on what's said but can't reliably reason about what's left unsaid — implicature, presupposition, speaker intent — scoring 32% on ambiguity recognition where humans hit 90% Why do LLMs fail at understanding what remains unsaid?. And theory of mind, perhaps the most anthropomorphized capacity of all, splits the same way: GPT-4.5 hits the 100th percentile predicting social norms, yet models regress on tasks requiring genuine reasoning about other minds, with surface strategies collapsing the moment scenarios go open-ended Why do LLMs excel at social norms yet fail at theory of mind?.
What makes this more than a list of deficits is that persuasion, the most socially consequential capacity, is *dissociable* from comprehension. Models sway debate audiences effectively while being unable to reliably evaluate the very arguments they deployed — meaning influence and understanding are separable, and fluency in one says nothing about the other Can LLMs persuade without actually understanding arguments?. This is the through-line: commentary assumes these capacities travel together because in humans they do. In LLMs they come apart.
The corpus also pushes back on lazy versions of this critique, which is where it gets interesting. 'Real reasoning vs. pattern matching' turns out to be a bad axis: humans and LLMs fail and succeed along the *same* content-sensitivity curve on classic reasoning tests, so content-independence isn't a meaningful line between machine and mind Do language models fail reasoning tests that humans pass?. And capability isn't uniformly worse — LLMs actually outperform humans at multi-hop reasoning across long contexts while losing to them on simple deduction, so the deficit is about *kind* of capability, not raw level Why do LLMs fail at simple deductive reasoning?.
The deepest answer reframes the whole question. One line argues the missing ingredient isn't a skill at all but *participatory subjectivity* — LLMs are shaped by the same shared symbolic system as humans, but only humans develop reflexive agency through being socialized into it. That absence shows up concretely: AI argues without ever declaring its own position or reflecting on its assumptions Do LLMs develop the same kind of mind as humans?. If you want a method for telling assumed-from-actual capacity apart rather than just cataloguing failures, the corpus offers Marr's three levels of analysis as a structured way to ask what a model is computing, how, and in what substrate Can cognitive science methods unlock how LLMs actually work? — and a hopeful counterpoint that some 'missing' reasoning is actually latent and merely un-elicited, recoverable by structuring the model's own calls rather than retraining it Can modular cognitive tools unlock reasoning without training?.
Sources 11 notes
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.
Research shows LLMs pattern-match on explicit language but cannot reason about implicatures, presuppositions, or speaker intentions. They fail at scalar implicature adaptation, ambiguity recognition (32% vs 90% human accuracy), and implicit warrant validation in arguments—core features of pragmatic competence.
GPT-4.5 reaches the 100th percentile on social norm prediction, yet o1 and Claude 3.7 regress on theory of mind tasks like Decrypto. Open-ended scenarios expose surface-level strategies hidden by structured questions, and reasoning effort does not improve social reasoning performance.
The Thin Line study shows LLMs sway debate participants and audiences but cannot reliably evaluate those same debates, with inter-annotator agreement ranging from near-zero to 0.6. Persuasive competence and pragmatic comprehension are separable capabilities.
Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.
The Minds vs. Machines benchmark shows LLMs excel at integrating information across multiple sentences while humans outperform them on straightforward logical inference. Capability type, not complexity level, determines who performs better.
Both humans and LLMs are shaped by the same intersubjective symbolic system, but only humans develop reflexive agency through socialization. This absence produces measurable differences in how AI argues without declaring its position or reflecting on its own assumptions.
Cognitive science's 70-year toolkit of behavioral probes, causal interventions, and representational analysis transfers directly to LLM interpretation. Marr's computational, algorithmic, and implementation levels reframe the problem structurally and enable layered rather than monolithic explanation.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.