Can language models adapt implicature to conversational context?
Do large language models flexibly modulate scalar implicatures based on information structure, face-threatening situations, and explicit instructions—as humans do? This tests whether pragmatic computation is truly context-sensitive or merely literal.
Scalar implicatures are a core pragmatic phenomenon: when someone says "some," it typically implies "not all." This is not semantically entailed but pragmatically inferred based on the maxim of quantity — if all were true, the speaker would have said "all." Human computation of these implicatures is sensitive to communicative context in documented ways.
Three experiments from Pragmatic Implicature Processing in ChatGPT (Ruytenbeek et al. 2024) tested whether ChatGPT shows human-like context-sensitivity in implicature. All three failed:
Generalized conversational implicatures: Humans can inhibit implicature computation when explicitly instructed to interpret utterances literally. ChatGPT failed to show this distinction — it doesn't switch between pragmatic and semantic processing modes.
Information structure sensitivity: For scalar implicatures, humans compute more "some but not all" inferences when the scalar term is in the information focus (the direct answer to an explicit question) than when it is in the background. ChatGPT showed no sensitivity to information structure.
Face context: Human scalar implicature rates differ between face-threatening and face-boosting contexts. If a poem is being evaluated and someone says "some people loved it," the implicature ("not all loved it") is more prominent in face-boosting contexts. ChatGPT showed no differential response to face context.
These are not exotic phenomena. They are the basic flexibility that allows human conversation to be more than literal string exchange. Pragmatic competence requires tracking the communicative context — who is asking, why, what stakes are involved — and modulating interpretation accordingly. ChatGPT's failure is not isolated to edge cases; it extends to routine context-modulation effects that appear in any human conversation.
A complementary finding in non-literal language: GPT-4o significantly overestimates irony likelihood in emojis compared to human perception (median irony scores significantly higher, W = 918.5, p < .001). When prompted to rate the likelihood of specific emojis being used ironically, GPT-4o considers the same emojis more likely to express irony than humans do — possibly due to disproportionate representation of ironic emoji usage in training data. Demographic information in prompts does not substantially change GPT-4o's irony classification. This parallels the implicature failure: the model cannot calibrate to actual human pragmatic norms for non-literal communication, whether the signal is scalar implicature or visual irony.
Source: Linguistics, NLP, NLU
Related concepts in this collection
-
Why does ChatGPT fail at implicit discourse relations?
ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
scalar implicatures are implicit inferences; extends this insight
-
Why do language models fail at communicative optimization?
LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?
implicature computation is a communicative optimization principle not captured by distribution
-
Why do speakers need to actively calibrate shared reference?
Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.
context-sensitivity in implicature is part of the calibration that LLMs skip
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm scalar implicature computation fails to adapt to communicative context