Can language models recognize when text is deliberately ambiguous?
Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.
AMBIENT (Blevins et al. 2023) is the first evaluation of pretrained LMs specifically on ambiguity recognition and disambiguation. 1,645 linguist-annotated examples with diverse ambiguity types: lexical ambiguity, structural ambiguity, scope ambiguity, and others.
The findings are stark:
- GPT-4 generated disambiguations rated correct by crowdworkers only 32% of the time
- Human reference disambiguations rated correct 90% of the time
- Best finetuned multilabel NLI model predicts the exact label set for ambiguous instances in only 43.6% of cases
Ambiguity management is central to human language understanding. As communicators, we anticipate possible misunderstandings. As listeners, we ask clarifying questions, revise interpretations based on new information, and use contextual factors to select among multiple possible readings. This capacity appears largely absent in current LLMs despite their fluency on standard benchmarks.
The task tests three distinct capabilities that all fail: generating relevant disambiguations, recognizing possible interpretations, and modeling different interpretations in continuation distributions. The failure is not isolated to one type but systematic across the full ambiguity management competence.
Since Do standard NLP benchmarks hide LLM ambiguity failures?, this failure is normally invisible in standard evaluation. The 32% figure is only visible because AMBIENT was designed to include what standard benchmarks exclude.
Augmented prompting can partially mitigate: a systematic approach combining Chain-of-Thought prompting with a knowledge base of sense interpretations, Part-of-Speech tagging, aspect-based filtering, and few-shot examples produces "substantial improvement" on WSD tasks. However, the fundamental challenge persists for highly diverse ambiguous words (10+ distinct senses across noun and verb forms) — current architectures remain "not confident enough" for these cases. The improvement comes from external scaffolding (KB, POS, examples), not from genuine semantic disambiguation competence, which reinforces the finding that LLMs handle explicit structure well but fail when multiple implicit interpretations must be managed simultaneously.
The literary analysis framing: Poetry is controlled ambiguity — deliberate multiplicity of meaning, crafted so that several readings coexist productively. A poem that resolves to a single meaning has failed as a poem. The 32% disambiguation rate means LLMs cannot even recognize the fundamental operation that makes poetry work. They cannot hold ambiguity open. They resolve it — and in resolving it, destroy it. This reframes the AMBIENT finding from a general limitation to a domain-killing one for literary work: the ability to manage ambiguity is not peripheral to literary analysis but central to it. Since Can LLMs truly understand literary meaning or just mechanics?, the ambiguity failure is one of four converging mechanisms behind the mechanics-meaning gap.
Source: Linguistics, NLP, NLU; enriched from inbox/research-brief-llm-literary-analysis-2026-03-02.md
Related concepts in this collection
-
Do standard NLP benchmarks hide LLM ambiguity failures?
When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?
why this failure is normally invisible
-
Why do speakers deliberately use ambiguous language?
Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.
what is being failed at
-
Why do large language models fail at complex linguistic tasks?
Explores whether LLMs have inherent limitations in detecting fine-grained syntactic structures, especially embedded clauses and recursive patterns, and whether these failures are systematic rather than random.
this is one of the deepest blind spots
-
Why does ChatGPT fail at implicit discourse relations?
ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
ambiguity failure is another asymmetry: explicit = manageable, multiple interpretations = failure
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llms fail at ambiguity recognition with gpt-4 achieving 32% correct disambiguations vs 90% for humans