How much semantic meaning survives when LLMs paraphrase poetry and literary text?
This explores what happens to the actual meaning of a poem or literary passage when an LLM rewrites it in different words — and the corpus suggests the surface mechanics survive while the deeper meaning leaks out.
This explores what survives when an LLM restates poetry or literary text in its own words — and the short answer the corpus converges on is: the machinery survives, the meaning doesn't. Several notes draw the same line in different places. LLMs are good at the dissectible, explicit layer of literature — metaphoric mappings, stylistic signatures, authorship fingerprints — but fail at the implicit, evaluative, ambiguous layer where literary meaning actually lives Can LLMs truly understand literary meaning or just mechanics?. Style detection saturates early (GPT-2 hits 95% on authorship from pattern alone) yet the model has no framework to say *why* those choices carry weight — detection without interpretation is cataloguing, not criticism Can language models truly understand literary style?.
The most concrete answer to 'how much survives' comes from the frequency work, which reveals a directional bias, not just random loss. LLMs systematically prefer high-frequency phrasings over rarer but equivalent ones, because they're tracking statistical mass from pretraining rather than recognizing meaning Do language models really understand meaning or just surface frequency?. The reason this matters for poetry is the second half of the mechanism: frequent words tend to be more abstract (general concepts outnumber specific ones), so a frequency-biased paraphrase drifts steadily toward abstraction and erases expert-level, fine-grained specificity Does word frequency correlate with semantic abstraction?. Poetry is precisely the genre that lives in the rare, specific, connotation-loaded word — so paraphrase pushes it toward the bland and general. 'Same meaning' prompts already produce different outputs for this reason; semantic equivalence is, in the corpus's blunt phrasing, a fiction Why do semantically identical prompts produce different LLM outputs?.
Two capacities poetry depends on are exactly where the models break. Ambiguity — holding several readings of a line at once — collapses: GPT-4 disambiguates only 32% of deliberately ambiguous cases versus 90% for humans, because it can't hold multiple interpretations simultaneously Can language models recognize when text is deliberately ambiguous?. And figurative language degrades along a spectrum: conventional, lexicalized metaphors paraphrase fine, but novel literary metaphors — the kind a poet invents — require genuine conceptual domain-mapping that pattern recognition can't do Where does LLM metaphor comprehension actually break down?. So the loss isn't uniform; it's heaviest exactly where the writing is most original.
There's a sharper framing worth pulling in from adjacent territory. One line of work reframes all figurative language — metaphor, idiom, pun — as a single pragmatic task: recovering literal meaning from non-literal expression Can one model handle all types of figurative language?. Note what that framing concedes: the success metric is *flattening* the non-literal into the literal. For poetry, the non-literal often *is* the meaning, so even a 'successful' paraphrase under this framing has discarded the thing you cared about. This connects to the deeper diagnosis that LLM understanding can be a 'potemkin' — correct explanation running on a pathway disconnected from correct application Can LLMs understand concepts they cannot apply?. A model can explain a poem's theme fluently and still produce a paraphrase that has quietly drained it.
The thing you might not have known you wanted to know: the loss is patterned and predictable, not noise. It flows in a specific direction — toward the frequent, the abstract, the literal, the single-reading — which means a paraphrase doesn't just lose meaning randomly, it loses meaning the way a photocopy of a photocopy loses contrast: specifics fade first, ambiguity gets resolved into one safe reading, and novel images get translated into conventional ones. If you want the grounding question underneath all this — whether a text-trained system can reach meaning anchored in lived human experience at all — the corpus's most generous answer is 'indirect causal grounding,' regularities extracted secondhand from causally grounded humans, with gaps Can large language models develop genuine world models without direct environmental contact?. For poetry, those gaps are the whole point.
Sources 10 notes
LLMs successfully extract explicit literary features like metaphoric mappings and stylistic signatures. However, they systematically fail at implicit relations (24% accuracy), ambiguity recognition (32% vs 90% human), evaluative stance-taking, and preserving connotation—the core dimensions where literary meaning operates.
GPT-2 achieves 95% accuracy identifying authorship through style patterns alone, but lacks the evaluative framework to explain why those stylistic choices carry meaning. Detection without interpretation remains cataloguing, not criticism.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
WordNet analysis shows hypernyms (general concepts) occur more frequently than hyponyms (specific ones). Combined with LLMs' frequency bias, this means preferring common paraphrases systematically drifts toward abstraction, erasing expert-level specificity.
Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
LLMs handle conventional, lexicalized metaphors but fail on novel literary metaphors requiring conceptual domain mapping. This degradation reveals a fundamental gap between pattern recognition and genuine semantic mapping.
The Diplomat dataset (4,177 dialogues) reframes metaphors, idioms, and puns as one pragmatic task: recovering literal meaning from non-literal expression. This framing suggests LLMs need better semantic decoupling ability, not more category-specific training data.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
LLMs form structured world representations by extracting regularities from training data produced by causally grounded humans. This constitutes indirect causal grounding mediated through text, though the chain has gaps that limit real-time verification and model updating.