Why does ChatGPT fail at implicit discourse relations?

ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?

Note · 2026-02-21 · sourced from Discourses

The discourse relations paper (ChatGPT on temporal, causal, and discourse relations) found a dramatic asymmetry in ChatGPT's discourse understanding:

Explicit discourse relations (with connectives like "so," "because," "however"): ChatGPT performs well, can recognize most relation types, and in-context learning with label dependence structure helps further
Implicit discourse relations (no connectives): 24.54% accuracy, 16.20% F1 on 11 second-level relation classes

This is not a small gap. 24.54% accuracy on implicit discourse relations is barely above chance for an 11-class task. ChatGPT "cannot understand the abstract sense of each discourse relation and the features from the text" when the surface connectives are absent.

The explanation is transparent: LLMs have access to massive training data where connectives are pervasive and reliable signals. When you see "therefore" or "because," the discourse relation is explicit in the surface form. Learning to respond to these signals is straightforward statistical learning. Inferring the same relations without surface signals requires understanding what the two clauses actually mean and what logical relationship holds between them.

This asymmetry shows that what LLMs have learned for discourse relation detection is largely cue-based — they respond to surface signals, not to structural meaning. When the surface cue is removed, the competence collapses.

This connects directly to What three layers must discourse systems actually track?: implicit discourse relation detection requires exactly the intentional structure that the linguistic structure alone doesn't carry.

A concrete instance beyond discourse relations: The same explicit/implicit asymmetry surfaces in metaphor extraction. LLMs can identify explicit source-target domain mappings (where the analogy's terms are stated) but fail on the implicit elements human readers routinely infer — e.g., the unstated target concept that completes a proportional analogy where only three of four terms are given. The failure is not specific to discourse-connective tasks; it is the general pattern wherever meaning depends on what is not said.

The literary analysis implication: Poetry and literary prose operate primarily through implicit relations. The connections between images in a poem, the causal logic of a narrative, the thematic resonance between scenes — these are rarely marked by explicit connectives. A poet does not write "the rose symbolizes mortality because..." The reader must infer the relation. This means the 24% implicit accuracy rate is not a peripheral limitation for literary analysis — it is a central one. Since Can LLMs truly understand literary meaning or just mechanics?, the discourse competence asymmetry is one of four converging mechanisms that explain why LLMs can parse literary texts mechanically but cannot interpret them meaningfully.

Source: Discourses; enriched from inbox/research-brief-llm-literary-analysis-2026-03-02.md

Related concepts in this collection

Why do LLMs handle causal reasoning better than temporal reasoning? Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.
asymmetric competence from training data distribution; parallel finding
What three layers must discourse systems actually track? Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.
what implicit relations require that surface cues don't provide
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
structural parallel: correct behavior on easy cases from surface heuristics
Can language models adapt implicature to conversational context? Do large language models flexibly modulate scalar implicatures based on information structure, face-threatening situations, and explicit instructions—as humans do? This tests whether pragmatic computation is truly context-sensitive or merely literal.
the pragmatic parallel: just as implicit discourse requires inferring unstated relations, scalar implicature requires context-sensitive pragmatic modulation — both fail for the same reason (surface cue dependence)

Concept map

21 direct connections · 138 in 2-hop network ·medium cluster

Why does ChatGPT fail at implicit discourse rela… Why do LLMs handle causal reasoning better than te… What three layers must discourse systems actually … Can models pass tests while missing the actual gra… Can language models adapt implicature to conversat…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

llm discourse competence is asymmetric: explicit connectives enable performance but implicit relations cause systematic failure