Language Understanding and Pragmatics

Can language models recognize when text is deliberately ambiguous?

Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.

Note · 2026-02-21 · sourced from Linguistics, NLP, NLU
Where exactly does language competence break down in LLMs? How should researchers navigate LLM reasoning research?

AMBIENT (Blevins et al. 2023) is the first evaluation of pretrained LMs specifically on ambiguity recognition and disambiguation. 1,645 linguist-annotated examples with diverse ambiguity types: lexical ambiguity, structural ambiguity, scope ambiguity, and others.

The findings are stark:

Ambiguity management is central to human language understanding. As communicators, we anticipate possible misunderstandings. As listeners, we ask clarifying questions, revise interpretations based on new information, and use contextual factors to select among multiple possible readings. This capacity appears largely absent in current LLMs despite their fluency on standard benchmarks.

The task tests three distinct capabilities that all fail: generating relevant disambiguations, recognizing possible interpretations, and modeling different interpretations in continuation distributions. The failure is not isolated to one type but systematic across the full ambiguity management competence.

Since Do standard NLP benchmarks hide LLM ambiguity failures?, this failure is normally invisible in standard evaluation. The 32% figure is only visible because AMBIENT was designed to include what standard benchmarks exclude.

Augmented prompting can partially mitigate: a systematic approach combining Chain-of-Thought prompting with a knowledge base of sense interpretations, Part-of-Speech tagging, aspect-based filtering, and few-shot examples produces "substantial improvement" on WSD tasks. However, the fundamental challenge persists for highly diverse ambiguous words (10+ distinct senses across noun and verb forms) — current architectures remain "not confident enough" for these cases. The improvement comes from external scaffolding (KB, POS, examples), not from genuine semantic disambiguation competence, which reinforces the finding that LLMs handle explicit structure well but fail when multiple implicit interpretations must be managed simultaneously.

The literary analysis framing: Poetry is controlled ambiguity — deliberate multiplicity of meaning, crafted so that several readings coexist productively. A poem that resolves to a single meaning has failed as a poem. The 32% disambiguation rate means LLMs cannot even recognize the fundamental operation that makes poetry work. They cannot hold ambiguity open. They resolve it — and in resolving it, destroy it. This reframes the AMBIENT finding from a general limitation to a domain-killing one for literary work: the ability to manage ambiguity is not peripheral to literary analysis but central to it. Since Can LLMs truly understand literary meaning or just mechanics?, the ambiguity failure is one of four converging mechanisms behind the mechanics-meaning gap.


Source: Linguistics, NLP, NLU; enriched from inbox/research-brief-llm-literary-analysis-2026-03-02.md

Related concepts in this collection

Concept map
14 direct connections · 122 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llms fail at ambiguity recognition with gpt-4 achieving 32% correct disambiguations vs 90% for humans