Conversational AI Systems

Do persona consistency metrics actually measure dialogue quality?

Personalized dialogue systems can achieve high persona consistency scores by simply restating character descriptions, ignoring conversational relevance. Does optimizing for persona fidelity necessarily harm the coherence readers actually care about?

Note · 2026-02-23 · sourced from Personalization
How do people come to trust conversational AI systems? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Personalized dialogue generation faces a persistent dual optimization problem: persona consistency and discourse coherence pull in different directions, and most methods sacrifice one for the other.

The measurement trap is revealing. Methods that achieve the highest personalization scores (e.g., PAA) do so by "frequently generating sentences that are exact restatements of the persona description, often ignoring the relevance to the query." High persona adherence metrics can be achieved trivially through description copying — which looks like success on the persona dimension while failing on the coherence dimension. This is not a training failure but a measurement artifact that rewards surface-level persona adherence.

The coherence side has two distinct components:

Local coherence — logical connections between adjacent sentences, ensuring they relate to each other and form a coherent sequence. This is sentence-to-sentence reasoning.

Global coherence — higher-level relationships across the entire dialogue, maintaining topic consistency and effectively conveying meaning throughout an interaction. Poor global coherence impairs understanding of the discourse as a cohesive whole.

MUDI addresses the trade-off by incorporating discourse relations directly into the generation architecture. Using 16 discourse relation types from the STAC annotation scheme plus a topic-shift relation, an LLM (LLaMA-3-70B) annotates coherence relations between utterance pairs. A graph encoder (DialogueGAT) captures these interactive relationships, with Sentence-BERT initializing node features for sentence-level semantics. The key architectural additions are order information and turn information integrated via attention mechanisms.

The broader principle: persona fidelity and contextual coherence must be jointly optimized, not separately measured. Since Why does supervised learning fail to enforce persona consistency?, the training method (RL for consistency) and the generation architecture (discourse-aware for coherence) address different dimensions of the same problem. Neither alone is sufficient.

This connects to the three-failure-modes analysis. Since Why do static persona descriptions produce repetitive dialogue?, the persona-restatement failure identified by MUDI is a fourth failure mode: not just repetitiveness, shallowness, and contradiction, but contextual irrelevance — generating persona-consistent but conversationally inappropriate responses.


Source: Personalization

Related concepts in this collection

Concept map
14 direct connections · 107 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

persona consistency trades off against discourse coherence in personalized dialogue — models that prioritize persona restate descriptions at the expense of contextual relevance