Dialogue State Tracking with a Language Model using Schema-Driven Prompting
Task-oriented conversational systems often use dialogue state tracking to represent the user’s intentions, which involves filling in values of pre-defined slots. Many approaches have been proposed, often using task-specific architectures with special-purpose classifiers. Recently, good results have been obtained using more general architectures based on pretrained language models. Here, we introduce a new variation of the language modeling approach that uses schema-driven prompting to provide task-aware history encoding that is used for both categorical and non-categorical slots. We further improve performance by augmenting the prompting with schema descriptions, a naturally occurring source of in domain knowledge.
In task-oriented dialogues, systems communicate with users through natural language to accomplish a wide range of tasks, such as food ordering, tech support, restaurant/hotel/travel booking, etc. The backbone module of a typical system is dialogue state tracking (DST), where the user goal is inferred from the dialogue history (Henderson et al., 2014; Shah et al., 2018; Budzianowski et al., 2018). User goals are represented in terms of values of pre-defined slots associated with a schema determined by the information needed to execute task-specific queries to the backend. In other words, user goals are extracted progressively via slot filling based on the schema throughout the conversation. In this paper, we focus on multi-domain DST where the dialogue state is encoded as a list of triplets in the form of (domain, slot, value), e.g. (“restaurant”, “area”, “centre”).
There are two broad paradigms of DST models, classification-based and generation-based models, where the major difference is how the slot value is inferred. In classification-based models (Ye et al., 2021; Chen et al., 2020), the prediction of a slot value is restricted to a fixed set for each slot, and non-categorical slots are constrained to values observed in the training data. In contrast, generation based models (Wu et al., 2019; Kim et al., 2020) decode slot values sequentially (token by token) based on the dialogue context, with the potential of recovering unseen values. Recently, generation based DST built on large-scale pretrained neural language models (LM) achieve strong results without relying on domain-specific modules. Among them, the autoregressive model (Peng et al., 2020a; Hosseini-Asl et al., 2020) uses a uni-directional encoder whereas the sequence-to-sequence model (Lin et al., 2020a; Heck et al., 2020) represents the dialogue context using a bi-directional encoder.
In this study, we follow a generation-based DST approach using a pre-trained sequence-to-sequence model, but with the new strategy of adding task specific prompts as input for sequence-to-sequence DST models, inspired by prompt-based fine-tuning (Radford et al., 2019; Brown et al., 2020a). Specifically, instead of generating domain and slot symbols in the decoder, we concatenate the dialogue context with domain and slot prompts as input to the encoder, where prompts are taken directly from the schema. We hypothesize that jointly encoding dialogue context and schema-specific textual information can further benefit a sequence-to-sequence DST model.