Language Understanding and Pragmatics Conversational AI Systems

What semantic failures break dialogue coherence most realistically?

Can we distinguish distinct types of incoherence by manipulating semantic structure rather than surface text? This matters because text-level evaluations miss the semantic failures that actually occur in dialogue systems.

Note · 2026-02-22 · sourced from Conversation Architecture Structure
Where exactly does language competence break down in LLMs? Why do AI conversations reliably break down after multiple turns? How should researchers navigate LLM reasoning research?

Evaluating dialogue coherence has relied on text-level manipulations — shuffling turn order, replacing utterances from external conversations. DEAM demonstrates these are insufficient: classifiers trained on text-level negatives CANNOT detect AMR-based semantic negatives, but classifiers trained on AMR-based negatives CAN detect text-level ones. Semantic-level incoherence is harder and more realistic.

The four failure modes map to distinct AI dialogue failures:

1. Contradiction — directly or indirectly contradicting previous utterances. Generated by adding polarity or replacing concepts with antonyms from ConceptNet. A common issue in deployed dialogue systems.

2. Coreference inconsistency — incorrect references to previously mentioned entities. Pronouns play an essential role — coherence is preserved through correct reference chains. Generated by manipulating argument nodes in AMR graphs.

3. Irrelevancy — utterances unrelated to the dialogue context. The simplest form (random substitution) was already captured by prior work, but AMR-based irrelevancy creates more subtle, natural-sounding deviations.

4. Decreased engagement — a speaker evading questions or failing to provide detail. Prior work ignored this failure mode entirely. In coherent conversations, speakers exchange detailed opinions, ask and answer questions. When one interlocutor becomes evasive or vague, coherence degrades even if individual utterances are grammatically and semantically acceptable.

The fourth failure mode is the most novel: decreased engagement is not a semantic error but a pragmatic one. The content is acceptable; the communicative effort is insufficient. This connects directly to the grounding problem. Since Why do language models sound fluent without grounding?, LLMs may produce responses that are semantically appropriate but pragmatically disengaged — answering without engaging.

The AMR approach works because Abstract Meaning Representation captures semantic structure (named entities, negations, coreferences, modalities) at a level deeper than surface syntax, allowing manipulations that produce natural-sounding but semantically incoherent text. The AMR-to-Text step ensures the negative examples sound realistic rather than obviously broken.

These four failure modes map onto What three layers must discourse systems actually track?: contradiction and coreference inconsistency involve the attentional component (tracking what entities are currently salient), irrelevancy involves the intentional component (whether an utterance serves the discourse purpose), and decreased engagement spans all three — a speaker who stops engaging is withdrawing from the linguistic, intentional, and attentional structure simultaneously.


Source: Conversation Architecture Structure

Related concepts in this collection

Concept map
15 direct connections · 107 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

dialogue coherence has four semantic-level failure modes distinguishable through AMR manipulation — contradiction coreference inconsistency irrelevancy and decreased engagement