Why do readers interpret the same sentence so differently?

How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.

Note · 2026-02-21 · sourced from Linguistics, NLP, NLU

The standard assumption underlying NLP benchmark design is that sentences have one correct interpretation. Disagreement between annotators signals annotation failure. The solution is to filter or adjudicate until one answer emerges.

Interpretation Modeling (IM, Cercas Curry et al. 2023) challenges this assumption directly. The study models multiple interpretations of socially embedded sentences, guided by reader attitudes toward the author and reader understanding of implicit moral judgments. Finding: conflicting interpretations are socially plausible. They reflect different social positions and moral framings, not annotation error.

This is not about ambiguous sentences in the traditional sense (lexical or syntactic ambiguity) but about the social and implicit dimensions of meaning in natural communication. A sentence embedded in a social context carries different meanings for readers with different:

Relationships to the speaker
Moral frameworks for evaluating the content
Common ground with the speaker's implied community

The interpretations that result are not all "correct" in a truth-conditional sense, but they are all "valid" in a socially and pragmatically grounded sense — readers with different social positions genuinely understand different things from the same text.

The implication is uncomfortable for NLP: the gold standard that benchmarks aspire to may not exist for a substantial portion of natural language. Treating disagreement as noise produces evaluation systems that measure agreement on easy cases while missing the hard question of how interpretation actually works.

The NLI disagreement literature provides statistical confirmation. "Lost in Inference" (analyzing NLI annotation disagreement across major benchmarks) finds that NLI task performance is not saturated — humans continue to disagree, and that disagreement is not random noise but structured. Human annotation distributions on contested examples carry information that the majority label discards. This is the empirical grounding for IM's theoretical claim: interpretation is irreducibly multiple, and the distribution over interpretations is itself meaningful data.

An additional mechanism: social identity projection. Readers don't just apply their moral frameworks abstractly — they project the likely social identity of the author based on textual cues, then interpret the content through the lens of that projected identity. Two readers who project different author identities from the same text will read the same words as carrying different social stances. This is a grounding claim about interpretation that goes beyond semantic ambiguity.

This connects to Why do speakers deliberately use ambiguous language? — interpretive multiplicity is not a failure of specification but a feature of how socially embedded language operates. Since Do standard NLP benchmarks hide LLM ambiguity failures?, this irreducibility is doubly hidden.

Source: Linguistics, NLP, NLU

Related concepts in this collection

Why do speakers deliberately use ambiguous language? Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.
interpretive multiplicity is functionally analogous to ambiguity: not a defect but a feature
Do standard NLP benchmarks hide LLM ambiguity failures? When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?
this multiplicity is what benchmark design excludes
What three layers must discourse systems actually track? Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.
intentional structure is where social framing operates
Why do LLM persona prompts produce inconsistent outputs across runs? Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
the attempt to use LLMs to simulate multiple human perspectives fails because LLMs lack the stable social situatedness that makes interpretation group-specific

Concept map

15 direct connections · 138 in 2-hop network ·dense cluster

Why do readers interpret the same sentence so di… Why do speakers deliberately use ambiguous languag… Do standard NLP benchmarks hide LLM ambiguity fail… What three layers must discourse systems actually … Why do LLM persona prompts produce inconsistent ou…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

sentence interpretations are irreducibly multiple because social position and moral framing generate competing readings