(QA)2: Question Answering with Questionable Assumptions

Paper · arXiv 2212.10003 · Published December 20, 2022
Natural Language InferenceArgumentationLinguistics, NLP, NLU

For instance, the question When did Marie Curie discover Uranium? cannot be answered as a typical when question without addressing the false assumption Marie Curie discovered Uranium. In this work, we propose (QA)2 (Question Answering with Questionable Assumptions), an open-domain evaluation dataset consisting of naturally occurring search engine queries that may or may not contain questionable assumptions. To be successful on (QA)2, systems must be able to detect questionable assumptions and also be able to produce adequate responses for both typical information-seeking questions and ones with questionable assumptions. Through human rater acceptability on end-to-end QA with (QA)2, we find that current models do struggle with handling questionable assumptions,

best-performing models at 56% human-judged abstractive QA acceptability, and 64% and 72% on the binary classification subtasks of questionable assumption detection and verification, respectively.

Presuppositions are backgrounded meanings associated with a linguistic utterance, as opposed to directly asserted content of the utterance. Information that is presupposed is taken for granted by all discourse participants, 4 and this presupposed content must be true in order for the utterance to be appropriate

Questionable Assumptions We define questionable assumptions of a question to be false or unverifiable assumptions that are likely to be believed by the asker. Assumptions under this definition relate to the epistemic bias of the speaker (Romero and Han, 2004; Eilam and Lai, 2009), and such epistemically biased propositions associated with a question may not be truly presupposed. For example, How many great white sharks are in captivity? does not presuppose that there exist great white sharks in captivity,5 but it is reasonably likely that this question was asked because the speaker believed that there in fact are great white sharks in captivity. Hence, assumptions of a question in this paper encompass both genuine presuppositions and epistemically biased propositions.

We group presuppositions and epistemically biased propositions together based on the following two reasons. First, distinguishing genuine presuppositions from epistemically biased propositions is empirically challenging. It is generally agreed upon that not all associated propositions of wh-questions are presuppositions.

Detection of Failure Requires Topical Expertise We observe that questions with questionable assumptions in the wild often do not stand out as bad questions immediately, as do examples like Which linguist invented the lightbulb?

Types of Questionable Assumptions A further analysis of the questionable assumptions reveals that most of the questionable assumptions are associated with either the wh-word (77%; who

Overall, we found that end-to-end QA is challenging, with the best model (text-davinci-003 with in-context demonstrations) at 56% human-judged acceptability. Nevertheless, it seems that the text-davinci series dramatically improves upon the capacities of existing models: zeroshot 002/003 models achieved about 20 percentage point improvement over the best nontext- davinci model (davinci, 28%).

In the zero-shot setting, models in general substantially underperformed at answering questions with questionable assumptions compared to answering valid questions, with most models only achieving half the performance when questionable assumptions were present.