Why do language models struggle with questions containing false assumptions?
Do LLMs reliably detect and reject questions built on false premises? The (QA)2 benchmark tests this directly, measuring whether models can identify problematic assumptions embedded in naturally plausible questions.
The (QA)2 benchmark (Question Answering with Questionable Assumptions) evaluates models on naturally occurring search engine queries — questions that may or may not contain false or unverifiable assumptions. On questions with questionable assumptions, models achieved roughly half the performance of their scores on valid questions in zero-shot settings. The best model (text-davinci-003 with in-context demonstrations) reached 56% human-judged acceptability end-to-end.
The key challenge: questions with false assumptions "in the wild often do not stand out as bad questions." A question like "When did Marie Curie discover Uranium?" requires topical expertise to detect the false assumption. In contrast, artificial examples ("Which linguist invented the lightbulb?") flag themselves immediately. Real questionable assumptions are embedded in naturally plausible-sounding questions.
Detection subtasks: binary detection of questionable assumptions (64% accuracy) and assumption verification (72%) were higher than end-to-end QA (56%), suggesting that even when models identify the false assumption, generating an appropriate response remains difficult. The response must simultaneously: detect the false presupposition, signal its falsity, correct it if possible, and then answer the actual question or explain why it can't be answered.
This quantifies the performance gap that Why do language models accept false assumptions they know are wrong? identifies qualitatively. The ~50% performance drop is measurable, systematic, and not solved by scale — the text-davinci series improved dramatically over previous models but the gap persists.
Source: Natural Language Inference
Related concepts in this collection
-
Why do language models accept false assumptions they know are wrong?
Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.
same failure domain; (QA)2 provides the performance quantification
-
Why do speakers need to actively calibrate shared reference?
Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.
questionable assumption handling requires exactly this: detecting when the questioner's presuppositions diverge from fact
-
Why are presuppositions more persuasive than direct assertions?
Explores why presenting information as shared background rather than as a claim makes it more persuasive to audiences. This matters because it reveals how language structure itself can bypass critical evaluation.
false presuppositions embedded in plausible-sounding questions are especially difficult to detect because they carry the persuasive force of backgrounded claims
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llms underperform by approximately 50% on questions with false assumptions compared to valid questions