DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations

Paper · arXiv 2203.09711 · Published March 18, 2022

Those models take a contrastive learning approach, where they build binary classifiers to differentiate positive, or coherent examples from negative, or incoherent dialogues. Those classifiers are usually trained on datasets constructed by using human-human conversations as positive examples and applying text-level heuristic manipulations to generate incoherent conversations. The text-level manipulations directly change the structures of the conversation such as shuffling the order of utterances, replacing some random utterances from external conversations (Vakulenko et al., 2018; Mesgar et al., 2020; Zhang et al., 2021), as shown in the third dialogue of Figure 1.

Automatic evaluation of open-domain dialogue systems has a multifaceted nature with many fine-grained quality aspects (Mehri and Eskénazi, 2020). Turn-level aspects show the quality of the system’s utterance given a dialogue context from different perspectives including appropriateness, relevance, engagement, and etc (Lowe et al., 2017; Tao et al., 2018; Ghazarian et al., 2020). Whereas, conversation-level facets such as coherence, diversity, informativeness take into account the whole dialog flow (Vakulenko et al., 2018; Zhang et al., 2021; Mehri and Eskénazi, 2020).

We propose four manipulation strategies to represent four common incoherence sources of the state-of-the-art dialogue models: contradiction, coreference inconsistency, irrelevancy and decrease engagements.

Moreover, DEAM is capable of distinguishing positive and negative examples generated by baselines that use text-level manipulations, whereas the opposite is not true – classifiers trained on text level manipulations cannot detect negative examples generated by DEAM.

Our goal is to build an evaluation metric that measures the conversation-level coherence of dialogues. We follow the trainable evaluation metrics (Vakulenko et al., 2018) to formulate the evaluation as a classification task. We train the evaluator on positive (coherent) and negative (incoherent) conversations, and take the predicted probability for the positive class as the coherence score.

As is discussed above, the main challenge for building a reliable metric is to obtain negative samples that can adequately represent the incoherence issues presented in advanced dialogue systems. To this end, we propose to generate negative examples by leveraging AMR-based manipulations. We then build a RoBERTa-based classifier as the evaluation metric by fine-tuning RoBERTa on the automatically generated training data. Figure 2 illustrates an overview of our proposed evaluation method.

4.1 Baselines Manipulations

Baseline manipulations can be classified as:

Shuffling-based manipulations: In such manipulations, turns order (Vakulenko et al., 2018), sequence of speakers utterances (Mesgar et al., 2020; Vakulenko et al., 2018; Zhang et al., 2021), or the position of the first and second sections of conversations (Vakulenko et al., 2018) are swapped.
Insertion-based manipulations: This group of manipulations add incoherence sources by replacing (Mesgar et al., 2020; Zhang et al., 2021) or inserting (Mesgar et al., 2020) a random utterance from a randomly selected conversation. Each baseline metric fuses multiple manipulations, hence we use their citations (Vakulenko et al., 2018), (Mesgar et al., 2020) to easily refer them in later sections.

4.2 AMR-based Manipulations AMR is originally proposed by Banarescu et al. (2013) as a semantic representation language that helps to abstract away the text from surface syntactic. Many abstract-level semantic information such as named entities, negations, questions, coreferences and modalities in the texts can be encoded by AMR graphs. These potential capabilities of AMR make it lucrative in many semantic-related NLP tasks such as summarization (Liao et al., 2018) and machine translation (Song et al., 2019). Conversations between two interlocutors contain many semantic details that can be captured by these graphs. Therefore, we explore AMR features’ usage in the dialogue systems evaluation task by manipulating the AMR graphs of coherent conversations, each manipulation reflecting a specific reason for incoherence in dialogue systems.

4.2.1 Contradiction

One of the common issues that dialogue systems struggle with is directly or indirectly contradicting previous utterances in dialogue. To replicate this type of error, a contradicted version of a subgraph from the original AMR is copied to other locations. This negative form AMRs can be accomplished by directly adding polarity to the concepts or replacing concepts with their antonyms that hold Antonym, NotDesires, NotCapableOf, and NotHasProperty relations in ConceptNet (Speer and Havasi, 2012). After adding contradictions, the AMR-to-Text model will use the encoded context to output incoherent yet natural conversations.

4.2.2 Coreference Inconsistency

The coherence of a conversation is preserved by the correct references of previously mentioned entities and words in the dialogue context. Pronouns in the conversation play an essential role in this regard. Coreferences in AMRs are presented as arguments (ARG) and all three different types of pronouns such as subjective, objective and possessive pronouns are shown in their subjective format.

4.2.3 Irrelevancy

Random utterance substitution from other conversations is a simple way to inject incoherence sources in dialogues, which has been frequently used in prior work

4.2.4 Decrease Engagement

In coherent conversations, speakers exchange opinions about different topics by stating detailed information, asking and answering questions. This coherence will be faded if one of the interlocutors evades to answer questions or talk in detail. In contrast to previous works that ignored this important feature, we augment such kind of incoherence sources into the negative sampling generation.