Dialogue Transformers
Conversational AI assistants promise to help users achieve a task through natural language. Interpreting simple instructions like please turn on the lights is relatively straightforward, but to handle more complex tasks, these systems must be able to engage in multi-turn conversations.
The goal of this paper is to show that the transformer architecture [1] is more suitable for modeling multi-turn conversations than the commonly used recurrent models. To compare the basic mechanisms that are at the heart of the sequence encoding we intentionally choose simple architectures. The proposed TED architecture should be thought of as a candidate building block for use in developing state-of-the-art architectures in various dialogue tasks.
Not every utterance in a conversation has to be a response to the most recent utterance by the other party. Groz and Sidner [3] consider conversations as an interleaved set of discourse segments, where a discourse segment (or topic) is a set of utterances that directly respond to each other. These sequences of turns may not directly follow one another in the conversation. An intuitive example of this is the need for sub-dialogues in task-oriented dialogue systems. Consider this conversation:
BOT: Your total is $15.50 - shall I
charge the card you used last time?
USER: Do I still have credit on my
account from that refund I got?
BOT: Yes, your account is $10 in credit.
USER: Ok, great.
BOT: Shall I place the order?
USER: Yes.
BOT: Done. You should have your items tomorrow.
In the 1980s, Groz and Sidner [3] argued for representing dialogue history as a stack of topics, and later the RavenClaw [4] dialogue system implemented a dialogue stack for the specific purpose of handling sub-dialogues. While a stack naturally allows for sub-dialogues to be handled and concluded, the strict structure of a stack is also limiting. The authors of RavenClaw argue for explicitly tracking topics to enable the contextual interpretation of the user intents. However, once a topic has been popped from the dialogue stack, it is no longer available to provide this context. In the example above, the user might follow up with a further question like so that used up my credit, right?. If the topic of refund credits has been popped from the stack, this can no longer help clarify what the user wants to know. Since there is in principle no restriction to how humans revisit and interleave topics in a conversation, we are interested in a more flexible structure than a stack.