Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection
In this paper, we study the task of selecting the optimal response given a user and system utterance history in retrieval-based multi-turn dialog systems. Recently, pre-trained language models (e.g., BERT, RoBERTa, and ELECTRA) showed significant improvements in various natural language processing tasks. This and similar response selection tasks can also be solved using such language models by formulating the tasks as dialog–response binary classification tasks. Although existing works using this approach successfully obtained state-of-the-art results, we observe that language models trained in this manner tend to make predictions based on the relatedness of history and candidates, ignoring the sequential nature of multi-turn dialog systems. This suggests that the response selection task alone is insufficient for learning temporal dependencies between utterances. To this end, we propose utterance manipulation strategies (UMS) to address this problem.
Utterance Manipulation Strategies
Figure 2 describes the overview of our proposed method, utterance manipulation strategies. We propose a multi-task learning framework, which consists of three highly effective auxiliary tasks for multi-turn response selection, utterance 1) insertion, 2) deletion, and 3) search. These tasks are jointly trained with the response selection model during the finetuning period. To train the auxiliary tasks, we add new special tokens, [INS], [DEL], and [SRCH] for the utterance insertion, deletion, and search tasks, respectively. We cover how we train the model with these special tokens in the following sections.
Utterance Insertion Despite the huge success of BERT, it has limitations in understanding discourse-level semantic structure since NSP, one of BERT’s objectives, mainly learns semantic topic shift rather than sentence order (Lan et al. 2020). In multi-turn response selection, the model needs the ability not only to distinguish the utterances with different semantic meanings but also to discriminate whether the utterances are consecutive even if they are semantically related. We propose utterance insertion to resolve the aforementioned issues.
We first extract k consecutive utterances from the original dialog context, then randomly select one of the utterances to be inserted. To train the model to find where the selected utterance should be inserted, [INS] tokens are positioned before each utterance and after the last utterance.