Language Understanding and Pragmatics

Why do language models struggle with questions containing false assumptions?

Do LLMs reliably detect and reject questions built on false premises? The (QA)2 benchmark tests this directly, measuring whether models can identify problematic assumptions embedded in naturally plausible questions.

Note · 2026-02-21 · sourced from Natural Language Inference
Where exactly does language competence break down in LLMs? How should researchers navigate LLM reasoning research?

The (QA)2 benchmark (Question Answering with Questionable Assumptions) evaluates models on naturally occurring search engine queries — questions that may or may not contain false or unverifiable assumptions. On questions with questionable assumptions, models achieved roughly half the performance of their scores on valid questions in zero-shot settings. The best model (text-davinci-003 with in-context demonstrations) reached 56% human-judged acceptability end-to-end.

The key challenge: questions with false assumptions "in the wild often do not stand out as bad questions." A question like "When did Marie Curie discover Uranium?" requires topical expertise to detect the false assumption. In contrast, artificial examples ("Which linguist invented the lightbulb?") flag themselves immediately. Real questionable assumptions are embedded in naturally plausible-sounding questions.

Detection subtasks: binary detection of questionable assumptions (64% accuracy) and assumption verification (72%) were higher than end-to-end QA (56%), suggesting that even when models identify the false assumption, generating an appropriate response remains difficult. The response must simultaneously: detect the false presupposition, signal its falsity, correct it if possible, and then answer the actual question or explain why it can't be answered.

This quantifies the performance gap that Why do language models accept false assumptions they know are wrong? identifies qualitatively. The ~50% performance drop is measurable, systematic, and not solved by scale — the text-davinci series improved dramatically over previous models but the gap persists.


Source: Natural Language Inference

Related concepts in this collection

Concept map
14 direct connections · 169 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llms underperform by approximately 50% on questions with false assumptions compared to valid questions