Why do LLMs fail at simple deductive reasoning?
LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?
The "Minds vs. Machines" entailment benchmark reveals a non-obvious asymmetry: LLMs outperform humans on multi-hop reasoning tasks that require integrating information across multiple sentences and knowledge types, while humans outperform LLMs on tasks requiring simple deductive inference.
This reverses the common intuition. We expect LLMs to handle simple cases well and fail on complex ones. Instead: the more complex multi-hop reasoning, the more advantaged LLMs become relative to humans. Simple deductive steps — the kind of inference humans find trivially obvious — are precisely where LLMs are weakest.
The knowledge type taxonomy matters: entity-grounded knowledge (facts about entities, verifiable externally), commonsense knowledge (implicit everyday reasoning, hard to articulate), and localized knowledge (context-specific, impossible to infer unless stated). LLMs handle entity-grounded reasoning better; humans handle commonsense inferences better.
This connects to the inversion captured in Does LLM grammatical performance decline with structural complexity? — but the failure mode is different. Grammatical complexity degrades LLM performance. Inferential complexity does not necessarily degrade it, and may improve it relative to humans who tire or miss multi-step chains. Structural complexity and inferential complexity have different profiles.
The practical implication: the right use case for LLM-assisted reasoning is complex multi-step inference that humans find cognitively taxing, not simple first-order deductions that humans find trivial. And Why do embedding contexts confuse LLM entailment predictions? shows the specific class of simple inference where LLMs fail worst: trivial entailments that humans find effortless.
Source: Natural Language Inference
Related concepts in this collection
-
Does LLM grammatical performance decline with structural complexity?
This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.
structural complexity degrades LLMs; inferential multi-hop complexity does not follow the same pattern
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
complementary: reasoning capability is training-regime specific; this notes that the regime shapes which *type* of task LLMs are better at than humans
-
Why do embedding contexts confuse LLM entailment predictions?
Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.
the specific trivial inferences where LLMs fail
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llms outperform humans at multi-hop reasoning in extended contexts but fail at simple deductive inference