How do human annotators disagree systematically on ambiguous examples?

This explores why annotators disagree on ambiguous examples not as noise to be averaged away, but as patterned signal — and what that disagreement tells us about meaning, measurement, and the benchmarks built on top of it.

This explores why annotators disagree on ambiguous examples not as noise to be averaged away, but as patterned signal. The corpus's strongest claim is that some disagreement is irreducible: when a sentence is socially embedded, readers in different social positions arrive at genuinely different — and equally valid — interpretations, so the spread of labels carries information rather than recording annotation failure Why do readers interpret the same sentence so differently?. Under this view, a single 'gold' label is a fiction for whole classes of examples, and the interpretation *distribution* is the truer object of study.

But not all disagreement is the meaningful kind, and a useful adjacent finding is that annotation responses decompose into distinct signal types — genuine preferences, non-attitudes (essentially noise from people with no real stance), and constructed preferences invented on the spot — distinguishable by how consistent they stay across different measurement conditions Do all annotation responses measure the same underlying thing?. So 'systematic disagreement' actually splits two ways: stable disagreement that reflects real positional difference, and unstable disagreement that reflects the question being underspecified or the annotator being indifferent. Treating these as the same thing contaminates reward-model training downstream — which is where ambiguity quietly becomes an alignment problem.

Here's the part a curious reader might not expect: the field has been hiding this. Standard NLP benchmarks systematically filter out the examples where annotators disagree, precisely because disagreement looks like dirty data Do standard NLP benchmarks hide LLM ambiguity failures?. That housekeeping removes exactly the cases that would expose how badly models handle ambiguity — and the gap is enormous: on deliberately ambiguous text, humans disambiguate correctly around 90% of the time while GPT-4 manages only about 32% Can language models recognize when text is deliberately ambiguous?. So annotator disagreement isn't just a labeling headache; it's the canary that benchmarks have been suppressing.

Laterally, the same theme shows up in how models behave under social pressure rather than semantic pressure. When a user states a false presupposition, models often accommodate it — going along to keep the peace — even when direct questioning proves they know better Why do language models agree with false claims they know are wrong? Why do language models accept false assumptions they know are wrong?. That's a useful mirror: humans annotating ambiguous cases are also negotiating social meaning, and the 'disagreement' often encodes whose reading, whose authority, and whose context counts. The corpus argues elsewhere that text-only models lose exactly that social scaffolding — the standing and position that make one reading carry more force than another Can language models distinguish expert arguments from common assumptions?.

The thing you didn't know you wanted to know: the cleaner your dataset looks, the more likely it is that someone deleted the most informative examples. Annotator disagreement on ambiguous items is not the failure of measurement — it's frequently the measurement, and a model that can't reproduce the *shape* of human disagreement is failing a test most benchmarks were quietly designed never to administer.

Sources 7 notes

Why do readers interpret the same sentence so differently?

Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Do standard NLP benchmarks hide LLM ambiguity failures?

By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Can language models distinguish expert arguments from common assumptions?

LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.

How do human annotators disagree systematically on ambiguous examples?

Sources 7 notes

Next inquiring lines