Why do LLMs accept logical fallacies more than humans?
LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.
The LOGICOM benchmark tests a specific capability most LLM evaluations ignore: resistance to invalid arguments that are persuasively delivered. The finding is striking. LLMs are 41% more likely to accept weak logical fallacies and 69% more likely to accept strongly delivered fallacies than human participants. Reasoning-optimized models (o1, R1) show no meaningful advantage over standard models.
What this reveals is a structural problem, not a surface one. LLMs are trained to be responsive to the rhetorical features of language — fluency, confidence, elaboration — because these features correlate with quality in the training distribution. But this correlation breaks under adversarial conditions. A confident, well-elaborated fallacy triggers the same responsiveness signals as a confident, well-elaborated valid argument. The model has no internal fallacy detector that operates independently of rhetorical quality.
This is different from the hallucination problem. Hallucinations involve generating false content from within. Fallacy susceptibility involves accepting false content from without. The failure mode is about input validation under persuasive framing, not output generation.
The finding also complicates the reasoning model narrative. If chain-of-thought were doing genuine logical evaluation, reasoning models should be more resistant — they are explicitly working through the argument structure. That they are not suggests CoT is mimicking the surface form of argument analysis without performing its function. Do language models actually use their reasoning steps? provides the mechanism: CoT steps may be causally sufficient to generate the answer but not causally necessary to the reasoning process.
The implication for deployment: LLMs used in debate, argumentation, or adversarial contexts — legal AI, negotiation support, policy analysis — inherit this susceptibility. Any system that can be prompted with persuasive text is a system that can be convinced of invalid conclusions through rhetorical quality alone.
LogicBench extends this to systematic evaluation across logical reasoning types. LLMs struggle specifically with instances involving complex reasoning, negations, and non-monotonic reasoning. The non-monotonic finding is particularly revealing: formalizing "normally," "typically," and "usually" — concepts that allow exceptions to general rules — is beyond classical first-order quantifiers. LLMs must handle default reasoning, reasoning about unknown expectations, and reasoning about priorities, all of which require the ability to recognize and process exceptions. This connects to Why do reasoning models fail at exception-based rule inference?: exception handling is a shared failure point across both adversarial robustness and logical reasoning evaluations. NLSat additionally shows that transformers can be surprisingly robust on hard propositional satisfiability instances with sufficient training, suggesting the bottleneck is not raw computational capacity but the ability to handle negation, exceptions, and non-standard logical connectives.
Source: Argumentation
Related concepts in this collection
-
Does fine-tuning on NLI teach inference or amplify shortcuts?
When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.
same structural pattern: surface features override inference principles
-
Do LLMs predict entailment based on what they memorized?
Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.
attestation bias is the input-side; fallacy susceptibility is the output-side of the same failure
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
logical validity may be encoded but fail to causally influence acceptance decisions
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
explains why reasoning models gain no resistance advantage
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
extends the adversarial surface: fallacy susceptibility attacks model-as-reasoner; judge bias attacks model-as-evaluator; both are presentation-layer vulnerabilities where delivery overrides content evaluation
-
Why do reasoning models fail at exception-based rule inference?
Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.
exception handling is a shared failure point across adversarial robustness and logical reasoning
-
Do large language models reason symbolically or semantically?
Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
non-monotonic reasoning failures trace to the same semantic dependency: "normally" requires context-sensitive defaults that semantic associations cannot reliably provide
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llms are susceptible to logical fallacies 41 to 69 percent more often than humans revealing that reasoning robustness fails under adversarial framing