How does the inability to manage ambiguity undermine literary analysis tasks?
This explores why LLMs struggle with literary analysis specifically because they can't hold several valid readings of the same text at once — and the corpus suggests the problem is structural, not a matter of more training.
This explores why the inability to manage ambiguity undermines literary analysis — and the corpus points to a clean diagnosis: machines can describe how literature works but can't sit inside its uncertainty. One study found LLMs comfortably extract the *mechanics* of literary language — metaphoric mappings, stylistic signatures — yet collapse on the dimensions where meaning actually lives: implicit relations (24% accuracy), evaluative stance, connotation, and above all ambiguity, where GPT-4 recognizes deliberately multiple readings only 32% of the time versus 90% for humans Can LLMs truly understand literary meaning or just mechanics?, Can language models recognize when text is deliberately ambiguous?. Literary analysis isn't decoding a fixed message; it's tolerating a passage that means two things on purpose. A reader who flattens that to a single interpretation hasn't analyzed the poem — they've replaced it.
What's striking is that the corpus reframes ambiguity not as noise to be cleaned up but as a *design feature* of language. Speakers deliberately exploit it for efficiency, polite indirection, and plausible deniability, so a system trained to resolve every sentence to one crisp answer fundamentally misreads what language is for Why do speakers deliberately use ambiguous language?. The same point arrives from the reader's side: interpretations of a socially loaded sentence are irreducibly multiple across different social positions, and that disagreement is meaningful signal, not annotation error Why do readers interpret the same sentence so differently?. Literary meaning lives precisely in this spread — which is the one thing a single-output model is built to erase.
Here's the part you might not expect: this failure has been hidden in plain sight. Standard NLP benchmarks routinely *filter out* the examples where human annotators disagree — exactly the ambiguous cases — so models look fluent while their deepest weakness goes untested Do standard NLP benchmarks hide LLM ambiguity failures?. The capability gap that matters most for literature is the one the evaluation pipeline is engineered not to see.
Why can't more scale fix it? Adjacent work suggests the breakage is architectural rather than informational. The 'Potemkin understanding' pattern shows models that explain a concept correctly, fail to apply it, and even recognize their own failure — a sign that explanation and execution run on disconnected pathways Can LLMs understand concepts they cannot apply?. Reasoning breakdowns track instance *novelty* rather than complexity, meaning models fit familiar patterns instead of generalizing Do language models fail at reasoning due to complexity or novelty?. Both imply that holding two live interpretations in tension — the core move of close reading — isn't a skill the current paradigm is failing to learn yet, but one it isn't shaped to do.
The more hopeful thread is that ambiguity-handling may be a *process* problem you can scaffold rather than a fixed ceiling. A leader-follower debate protocol, where one agent proposes interpretations and rotating challengers attack them, pushed a small 7B model to 76.7% on ambiguity detection — better verification through forced disagreement Can structured debate roles help small models detect ambiguity?. And reframing figurative language (metaphor, idiom, pun) as a single pragmatic task of recovering meaning from non-literal expression hints that what literary analysis needs is better semantic decoupling, not more category labels Can one model handle all types of figurative language?. The takeaway: literary analysis is hard for machines not because the prose is fancy, but because it demands staying in the unresolved — and the most promising fixes manufacture disagreement instead of resolving it away.
Sources 9 notes
LLMs successfully extract explicit literary features like metaphoric mappings and stylistic signatures. However, they systematically fail at implicit relations (24% accuracy), ambiguity recognition (32% vs 90% human), evaluative stance-taking, and preserving connotation—the core dimensions where literary meaning operates.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
Research shows speakers exploit ambiguity to balance efficiency against clarity, enable polite indirection, and permit plausible deniability. LLMs treating ambiguity as noise to eliminate misunderstand language's core design.
Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.
By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Mistral-7B achieved 76.7% accuracy in ambiguity detection through a protocol where a leader proposes interpretations and two followers challenge them with rotating roles. Role rotation and consensus forcing prevent persuasive framing failures and create stronger verification than pairwise debate.
The Diplomat dataset (4,177 dialogues) reframes metaphors, idioms, and puns as one pragmatic task: recovering literal meaning from non-literal expression. This framing suggests LLMs need better semantic decoupling ability, not more category-specific training data.