Language Understanding and Pragmatics Conversational AI Systems

What semantic failures break dialogue coherence most realistically?

Can we distinguish distinct types of incoherence by manipulating semantic structure rather than surface text? This matters because text-level evaluations miss the semantic failures that actually occur in dialogue systems.

Note · 2026-02-22 · sourced from Conversation Architecture Structure

Evaluating dialogue coherence has relied on text-level manipulations — shuffling turn order, replacing utterances from external conversations. DEAM demonstrates these are insufficient: classifiers trained on text-level negatives CANNOT detect AMR-based semantic negatives, but classifiers trained on AMR-based negatives CAN detect text-level ones. Semantic-level incoherence is harder and more realistic.

The four failure modes map to distinct AI dialogue failures:

1. Contradiction — directly or indirectly contradicting previous utterances. Generated by adding polarity or replacing concepts with antonyms from ConceptNet. A common issue in deployed dialogue systems.

2. Coreference inconsistency — incorrect references to previously mentioned entities. Pronouns play an essential role — coherence is preserved through correct reference chains. Generated by manipulating argument nodes in AMR graphs.

3. Irrelevancy — utterances unrelated to the dialogue context. The simplest form (random substitution) was already captured by prior work, but AMR-based irrelevancy creates more subtle, natural-sounding deviations.

4. Decreased engagement — a speaker evading questions or failing to provide detail. Prior work ignored this failure mode entirely. In coherent conversations, speakers exchange detailed opinions, ask and answer questions. When one interlocutor becomes evasive or vague, coherence degrades even if individual utterances are grammatically and semantically acceptable.

The fourth failure mode is the most novel: decreased engagement is not a semantic error but a pragmatic one. The content is acceptable; the communicative effort is insufficient. This connects directly to the grounding problem. Since Why do language models sound fluent without grounding?, LLMs may produce responses that are semantically appropriate but pragmatically disengaged — answering without engaging.

The AMR approach works because Abstract Meaning Representation captures semantic structure (named entities, negations, coreferences, modalities) at a level deeper than surface syntax, allowing manipulations that produce natural-sounding but semantically incoherent text. The AMR-to-Text step ensures the negative examples sound realistic rather than obviously broken.

These four failure modes map onto What three layers must discourse systems actually track?: contradiction and coreference inconsistency involve the attentional component (tracking what entities are currently salient), irrelevancy involves the intentional component (whether an utterance serves the discourse purpose), and decreased engagement spans all three — a speaker who stops engaging is withdrawing from the linguistic, intentional, and attentional structure simultaneously.

Source: Conversation Architecture Structure

Related concepts in this collection

How do readers track segments, purposes, and salience together? Can discourse processing actually happen in parallel rather than sequentially? This matters because understanding how readers coordinate multiple layers of meaning at once reveals where AI systems break down in comprehension.
DEAM provides four specific failure modes within the coherence tracking framework
Why do language models sound fluent without grounding? Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?
decreased engagement is a specific form of the grounding gap: technically responding but not communicatively working
Why does ChatGPT fail at implicit discourse relations? ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
AMR-based incoherence operates at the implicit level where LLMs fail
What three layers must discourse systems actually track? Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.
DEAM's four failure modes map onto Grosz & Sidner's three components: contradiction and coreference inconsistency involve the attentional component (tracking salient entities), irrelevancy involves the intentional component (purpose alignment), and decreased engagement spans all three
Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns? Does encoding linguistic complexity, emotion, topics, and relevance as parallel temporal streams expose emergent patterns that traditional statistical analysis misses? This matters because conversation success may depend on interactions between dimensions, not individual features alone.
DEAM's four failure modes would produce distinct signatures in Conversational DNA's multi-dimensional tracking: contradiction as semantic volatility, coreference as referential discontinuity, engagement as temporal trajectory decline
Do language models segment events like human consensus does? Can GPT-3 identify event boundaries in narrative text the way humans do? This matters because it could reveal whether language models and human cognition share similar predictive mechanisms for understanding continuous experience.
event segmentation provides temporal scaffolding for coherence: correctly segmented events make contradictions and coreference inconsistencies detectable within and across event boundaries
What six problems must every conversation solve? Schegloff's Conversation Analysis identifies six universal organizational challenges that speakers navigate in all talk-in-interaction. Understanding these helps explain why current AI dialogue systems fall short of human fluency.
DEAM's failure modes map to specific Schegloff orders: contradiction and coreference signal trouble-handling failures (understanding problems not repaired), decreased engagement is action-formation failure (speaker stops performing appropriate actions), irrelevancy is sequence-organization failure (turn doesn't cohere with prior)
Can conversation structure predict dialogue success better than content? Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.
DEAM's failure modes would produce distinct TRACE geometric signatures: contradiction as distance spikes, coreference as referential drift, engagement as flattened dynamics

Concept map

15 direct connections · 107 in 2-hop network ·medium cluster

What semantic failures break dialogue coherence … How do readers track segments, purposes, and salie… Why do language models sound fluent without ground… Why does ChatGPT fail at implicit discourse relati… What three layers must discourse systems actually … Can tracking dialogue dimensions simultaneously re… Do language models segment events like human conse… What six problems must every conversation solve? Can conversation structure predict dialogue succes…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

dialogue coherence has four semantic-level failure modes distinguishable through AMR manipulation — contradiction coreference inconsistency irrelevancy and decreased engagement