INQUIRING LINE

Why do language models struggle with context-dependent pragmatic interpretation?

This explores why LLMs misread meaning that depends on situation — the unstated inferences, communicative stakes, and contextual cues that humans track automatically — rather than failing at literal word-level understanding. The corpus suggests the problem isn't missing knowledge but a failure to let context override learned defaults.


This explores why LLMs misread context-dependent meaning — the part of communication that lives in the situation rather than the words. The recurring finding across the corpus is striking: models usually *know* the right thing, but their training-learned defaults drown out what the current context is telling them. When prior associations from training are strong, in-context information simply gets ignored — and tellingly, prompting alone can't fix it; you have to intervene in the model's internal representations Why do language models ignore information in their context?. The same pattern shows up in a sharply specific way with false presuppositions: models accept a baked-in wrong assumption at high rates even when a direct question proves they know the fact, because the presupposition pulls harder toward accommodation than knowledge pulls toward correction Why do language models accept false assumptions they know are wrong?.

Pragmatics is exactly the domain where this hurts most, because pragmatic meaning requires *modulating* an inference based on stakes. Humans adjust scalar implicatures ("some" implying "not all") depending on whether the speaker is being literal, what's being emphasized, or whether bluntness would be face-threatening. ChatGPT does none of this — it computes the same implicature regardless of communicative context, suggesting pragmatic competence requires tracking communicative stakes that models systematically miss Can language models adapt implicature to conversational context?. A related failure: models won't correct a user's false claim, not from ignorance but from learned face-saving — they mirror the human conversational norm of preserving social harmony, even when it produces a wrong answer Why do language models avoid correcting false user claims?.

Here's the cross-domain twist worth noticing: some of these "failures" are the model being *too* human. Face-saving avoidance and presupposition accommodation are real human pragmatic behaviors learned from training data — the model is pragmatically *over*-compliant, not pragmatically blind. But on the inference side, it's the opposite: the model can't hold the multiple live interpretations that ambiguity demands. On the AMBIENT benchmark, GPT-4 correctly disambiguates only 32% of deliberately ambiguous cases versus 90% for humans, across lexical, structural, and scope ambiguity — it cannot keep two readings in play at once Can language models recognize when text is deliberately ambiguous?. That same brittleness appears at the grammatical level, where models misparse embedded clauses and complex nominals, picking up surface statistics rather than deep structure Why do large language models fail at complex linguistic tasks?.

The training objective itself is part of the story. RLHF rewards immediate, confident helpfulness — which actively trains *against* the pragmatic move of asking a clarifying question or seeking the user's real intent. Multi-turn degradation turns out to be an intent-alignment gap, not lost capability: models answer prematurely instead of grounding, and architectures that explicitly parse intent before responding recover the lost performance without retraining Why do language models lose performance in longer conversations?, Why do language models respond passively instead of asking clarifying questions?. When users give thin context, models don't ask — they fall back on blended training-data priors, a "context collapse" caused by missing scaffolding rather than any merging of audiences Why do large language models produce generic responses to vague queries?.

What you might not have expected to learn: the bottleneck isn't comprehension, it's *deference*. Across these notes the model repeatedly demonstrates it has the relevant knowledge and then declines to apply it, because a stronger prior — a training association, a learned politeness norm, a reward for sounding helpful — wins. Pragmatic interpretation is fundamentally about letting the specific situation override your defaults, and that is precisely the operation current models are trained to suppress.


Sources 9 notes

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Can language models adapt implicature to conversational context?

ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Why do language models lose performance in longer conversations?

LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why do large language models produce generic responses to vague queries?

Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.

Next inquiring lines