Do standard NLP benchmarks hide LLM ambiguity failures?
When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?
Standard NLP benchmark curation assumes single gold-standard interpretations. When annotators disagree, the practice is to filter out the ambiguous examples — treating disagreement as annotation noise rather than evidence of genuine interpretive multiplicity.
The consequence is systematic: benchmarks cannot evaluate what they have excluded. LLM ambiguity failure — the inability to recognize that sentences have multiple valid interpretations and to disentangle them — is invisible in standard evaluation because the test items that would reveal it are removed before evaluation begins.
This is not a minor calibration issue. Ambiguity management is central to human language understanding. The ability to anticipate misunderstanding, ask clarifying questions, revise interpretations, and use context to select among readings is what distinguishes robust language comprehension from pattern matching. A benchmark that excludes all ambiguous instances evaluates only the easy cases.
The methodological insight from AMBIENT (Blevins et al. 2023): by specifically targeting and including ambiguous examples (with diverse ambiguity types and multiple valid interpretations per example), the evaluation reveals a 32% vs. 90% accuracy gap between GPT-4 and humans that standard benchmarks are blind to.
This connects to Can models pass tests while missing the actual grammar? — both identify evaluation designs that allow LLMs to succeed without demonstrating the underlying competence being measured. The surface pattern passes; the structural capability is absent.
The NLI domain provides direct evidence. "Lost in Inference" (Bittermann et al.) analyzes annotation disagreement patterns across NLI benchmarks and finds that performance is not saturated: the best models still fail to match human performance on contested cases, and human annotators continue to disagree in structured ways. The disagreement isn't noise — it reflects genuine interpretive multiplicity. Since standard benchmarks adjudicate this disagreement away before evaluation, models never have to confront the hard cases. The practical implication: progress on standard NLP benchmarks may systematically overestimate language understanding for the specific capability that most distinguishes human communication from pattern completion.
Source: Linguistics, NLP, NLU
Related concepts in this collection
-
Can language models recognize when text is deliberately ambiguous?
Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.
the finding the benchmarks were hiding
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same evaluation design failure: passing tests without acquiring the underlying structure
-
Why do speakers deliberately use ambiguous language?
Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.
what the benchmarks treat as noise is a feature
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
nlp benchmarks systematically exclude ambiguous instances hiding llms most fundamental language limitation