INQUIRING LINE

Why do explicit discourse connectives work when implicit relations fail?

This explores why LLMs handle discourse relations well when a linking word like 'because' or 'but' is on the page, yet fall apart when the same relationship is left unstated — and what that gap reveals about how these models actually process meaning.


This explores why LLMs handle discourse relations well when a linking word like "because" or "but" is on the page, yet collapse when the same relationship has to be inferred. The short version from the corpus: explicit connectives are *surface signals the model can pattern-match*, while implicit relations require actually reasoning about meaning — and that reasoning is the thing these models mostly don't do. ChatGPT performs strongly on explicit discourse relations but drops to roughly 24% accuracy on implicit ones, which is a striking tell that its competence rides on the connective itself, not on understanding the semantic content underneath Why does ChatGPT fail at implicit discourse relations?.

The same asymmetry shows up wherever the cue is explicit versus inferred. Causal reasoning beats temporal reasoning in LLMs for exactly this reason — causal connectives ("because," "therefore") are frequent and explicit in training text, while temporal order is usually left implicit and has to be reconstructed from context Why do LLMs handle causal reasoning better than temporal reasoning?. Zoom out and it's a general pattern: models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and anything requiring forward-planning across a discourse Where exactly do language models fail at structural language tasks?. The connective isn't just a hint — it's load-bearing scaffolding the model leans on instead of building its own.

What makes this more than a quirk is that the failure isn't ignorance — it's a refusal to compute structure that's present. Models treat presupposition triggers and non-factive verbs as surface cues rather than working out their actual semantic effect on entailment, so embedding contexts become systematic blind spots Why do embedding contexts confuse LLM entailment predictions?. They'll accommodate a false presupposition even when a direct question proves they know the correct fact Why do language models accept false assumptions they know are wrong?, and they fail to adjust scalar implicatures to conversational context the way humans reflexively do Can language models adapt implicature to conversational context?. In each case the knowledge is there; what's missing is the structural inference step that an explicit marker would otherwise spare them from taking.

This connects to a deeper claim worth pulling forward: chain-of-thought reasoning shows the same signature. CoT works by constraining the model to reproduce familiar reasoning *forms* from training rather than performing novel inference, and it degrades under distribution shift — the fingerprint of imitation, not capability Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Explicit connectives are essentially the discourse-level version of that crutch: a learned form the model can echo. Tasks that demand integrating an inferential pattern across distributed spans — argument scheme classification, for instance — plateau far below tasks with local surface features Why does argument scheme classification stumble where other NLP tasks succeed?.

The thing you might not have expected to learn: this whole pattern is arguably what language models *are*, not a bug to be patched. One line of thinking holds that LLMs operationalize Saussure's *langue* — they compress the relational structure of text without any external referent or grounding Can language models learn meaning without engaging the world?. An explicit connective lives inside that relational system; an implicit relation points outside it, to inference about a world the model never touches. Seen that way, the explicit/implicit gap isn't a quirk of one benchmark — it's the visible seam between pattern-completion and the structural understanding these systems were never built to have.


Sources 9 notes

Why does ChatGPT fail at implicit discourse relations?

ChatGPT performs well on explicit discourse relations with connectives but achieves only 24.54% accuracy on implicit relations without them. This asymmetry reveals that LLMs rely on surface signals rather than inferring meaning from semantic content.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Where exactly do language models fail at structural language tasks?

Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Can language models adapt implicature to conversational context?

ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Why does argument scheme classification stumble where other NLP tasks succeed?

Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Next inquiring lines