We’re Afraid Language Models Aren’t Modeling Ambiguity
Ambiguity is an intrinsic feature of natural language. Managing ambiguity is a key part of human language understanding, allowing us to anticipate misunderstanding as communicators and revise our interpretations as listeners. As language models are increasingly employed as dialogue interfaces and writing aids, handling ambiguous language is critical to their success. We capture ambiguity in a sentence through its effect on entailment relations with another sentence, and collect AMBIENT,1 a linguist-annotated benchmark of 1,645 examples with diverse kinds of ambiguity. We design a suite of tests based on AMBIENT, presenting the first evaluation of pretrained LMs to recognize ambiguity and disentangle possible meanings. We find that the task remains extremely challenging, including for GPT-4, whose generated disambiguations are considered correct only 32% of the time in crowdworker evaluation, compared to 90% for disambiguations in our dataset. Finally, to illustrate the value of ambiguity-sensitive tools, we show that a multilabel NLI model can flag political claims in the wild that are misleading due to ambiguity. We encourage the field to rediscover the importance of ambiguity for NLP.
Ambiguity seems to be an essential, indispensable element for the transfer of information from one place to another by words. — Thomas (1974), as referenced in the epilogue of Grosz (1977)
Ambiguity is an intrinsic feature of language, allowing speakers to balance efficiency and clarity in communication (Zipf, 1949; Piantadosi et al., 2012). Language understanding thus requires recognizing the presence of multiple interpretations: as communicators, we anticipate the possibility of misunderstanding; as listeners, we ask clarifying questions, disambiguate meanings on the basis of a wide range of contextual factors, and backtrack and revise our earlier interpretations as needed. Beyond unintended miscommunication, ambiguity is also an effective tool for sending covert messages, e.g., out of politeness or to mislead one’s listeners while avoiding accountability (see Figure 1).
As language models (LMs) are increasingly employed to act as dialogue agents (OpenAI, 2022; Shuster et al., 2022) or to aid human communication as writing aids (Lee et al., 2022), being able to work with ambiguous language will make them more effective. This skill would support adaptation to different contexts, clearer communication, and identification of misleading or deceptive language. Yet, the ability of pretrained LMs to recognize ambiguity and disentangle possible meanings remains unstudied, partly because ambiguous instances are systematically excluded in the curation of benchmarks (Beigman Klebanov and Beigman, 2009).
Each AMBIENT example consists of a premise and hypothesis pair, assigned a set of labels (among entailment, neutral, and contradiction), along with disambiguating rewrites corresponding to each label when multiple are plausible (see Table 1 for examples). Examples are collected through two approaches: manual curation to target textbook ambiguities, and expert annotation of automatically generated unlabeled examples to uncover more diverse phenomena. Through analysis, we find that crowdworkers can reliably distinguish different readings of an ambiguous sentence and their impact on entailment choices; thus we can explicitly characterize the underlying reasons for uncertainty that would otherwise surface as “disagreement” (§3).
We design a suite of tests based on AMBIENT to investigate the extent to which understanding of ambiguity is acquired during pretraining of large LMs (§4). These tests evaluate whether LMs can directly produce relevant disambiguations, recognize possible interpretations, and model different interpretations in their continuation distributions. We find that these tasks remain extremely challenging, including for the recent GPT-4 (OpenAI, 2023).
Therefore, we additionally investigate whether LMs can be finetuned on existing NLI data for the less demanding task of ambiguity recognition, without explicit disambiguation (§5). We adapt several finetuned NLI models to a multilabel setting, and find that the best model predicts the exact label set in only 43.6% of instances, suggesting that the NLI
The simplifying assumption that text has only one interpretation has facilitated the development of large-scale benchmarks, yet limits the depth of what these benchmarks can evaluate. In this work we show that sensitivity to ambiguity—a fundamental aspect of human language understanding— is lacking in our ever-larger models, and illustrate the value such understanding could bring.