Task-Oriented Dialogue with In-Context Learning

Paper · arXiv 2402.12234 · Published February 19, 2024

We describe a system for building task oriented dialogue systems combining the in context learning abilities of large language models (LLMs) with the deterministic execution of business logic. LLMs are used to translate between the surface form of the conversation and a domain-specific language (DSL) which is used to progress the business logic. We compare our approach to the intent-based NLU approach predominantly used in industry today. Our experiments show that developing chatbots with our system requires significantly less effort than established approaches, that these chatbots can successfully navigate complex dialogues which are extremely challenging for NLU-based systems, and that our system has desirable properties for scaling task oriented dialogue systems to a large number of tasks.

The workhorse of industrial task-oriented dialogue systems and assistants is a modular architecture comprising three components: natural language understanding (NLU), dialogue management (DM) 2 , and natural language generation (NLG) (Young et al., 2013; Young, 2007).

Utterances spoken or written by end users are translated into dialogue acts, where a dialogue act comprises an intent and a set of entities. For example, an utterance such as “I need a taxi to the station” might be assigned to the intent book taxi and the entity destination with value “station”. This dialogue act representation acts as the interface between the NLU and DM components of the system. The dialogue manager contains the logic to react to a book taxi intent by initiating a taxi booking task, prompting the end user for the time, pick-up location, etc. These fields are typically called slots. As the dialogue progresses, subsequent user messages are also represented as dialogue acts, such as inform(time=3pm). The dialogue manager reacts to this sequence of inputs by executing actions and responding to the end user, either via a rule-based or a model-based dialogue policy. We refer to this as the intent-based NLU approach, and it is used by the major industry platforms for building chat- and voice-based dialogue systems like Rasa (Bocklisch et al., 2017), Dialogflow3, Microsoft Luis4, and IBM Watson5.

A key feature of the intent-based NLU approach is that it poses natural language understanding as a classification task: messages are “understood” by assigning them to a predefined intent. This is a powerful simplifying assumption. In theory, intents provide an interface that fully abstracts the language understanding component from the dialogue manager.

However, working with a fixed list of intents has limitations which become more pronounced as an application matures and scales:

• The taxonomy of intents becomes difficult to remember and reason about when the number of intents reaches several hundred, complicating annotation & feedback loops as well as application debugging.

• Because the dialogue manager is coded to expect specific sequences of intents, making changes to intent definitions and introducing new intents becomes increasingly error-prone, as shifts in classifier outputs introduce regressions.

• Intents are typically defined to map closely to the tasks the assistant can perform, but user utterances often do not correspond directly to a specific task. A developer may create intents like replace card and block card, but end users often describe situations in their own terms (e.g. “I lost my wallet”), which could map to a number of different tasks.

• Messages are assigned to the same intent irrespective of context.6

In this work we aim to develop a system for building industrial task-oriented dialogue systems with the following attributes:

Fast iteration: The system should allow for rapid prototyping and testing. The delay between making a change (e.g. modifying task logic) and testing it should ideally be measured in seconds.

Short development time: The system should provide general conversational capabilities out of the box, so that developers can focus on implementing their unique business logic.

Concise representation of business logic: Both developers and subject-matter experts with less technical knowledge should be able to create and modify task logic easily.

Reliable execution of business logic: Business logic of arbitrary complexity should be executed reliably, i.e. we should not rely on a language model to remember and follow a set of steps and branching conditions.

Explainable and debuggable: It should be possible to explain why the system responded in a certain way at any given time.

Scalable to a large number of tasks: It is common for AI assistants in industry to support hundreds of tasks. The system needs to be able to identify the correct task out of hundreds of possibilities. Maintaining the system and adding new tasks should not become more complex as the system increases in size.

Model agnostic: Progress in LLMs is rapid and the approach should allow developers to adopt the latest models without having to re-implement their business logic

Other approaches have developed new representations and data structures to change how the task of building a dialogue system is formulated.

(Cheng et al., 2020) introduced a hierarchical graph structure which represents the ontology of a task. Dialogue state tracking is then framed as a semantic parsing task over this structure.

(Andreas et al., 2020) represents the dialogue state as a dataflow graph, where the effect of each turn in a dialogue is to modify this graph. They show that using this representation, a generic seq2seq model can match the performance of neural architectures specifically designed for dialogue state tracking. One similarity between our system and the dataflow approach is that both use the output of a model to generate instructions, and then deterministically execute some logic. However, there are significant differences in the two approaches. Our work also uses a graph representation of computational steps, but only to represent the business logic for a specific task, as designed explicitly be a developer. Additionally, the dataflow approach produces computational steps such as the refer operation, which handles anaphora and entity resolution. In our system, the dialogue manager does not participate in language understanding, and coreference resolution is always handled implicitly, by including the conversation transcript in the LLM prompt and generating commands with the arguments already fully resolved. More generally, our approach uses the conversation transcript as a general-purpose representation of conversation state, while we use an explicit state representation only to track progress within the logic of a given task.

Recently, researchers have explored using the in-context learning (Brown et al., 2020) abilities of LLMs to have them act fully independently as dialogue systems (Yao et al., 2023), an approach sometimes referred to as LLM “Agents”. This line of work assumes that the business logic required to complete a task is not known a priori and must be inferred on-the-fly as a conversation progresses. This approach explores the possibility of creating fully open-ended assistants which can help with an infinite number of tasks, with the caveat that the developer of the assistant does not control the task logic. Industrial dialogue systems, on the other hand, typically support a known set of tasks whose logic needs to be followed faithfully.

Our architecture comprises three core elements: Business Logic, Dialogue Understanding, and Conversation Repair. When an end user sends a message to an assistant, the following takes place:

The dialogue understanding module interprets the conversation so far and translates the latest user message into a set of commands.
The generated commands are validated and processed by the dialogue manager to update the conversation state.
If the user message requires conversation repair, the corresponding repair patterns are added to the conversation state.
The dialogue manager executes the relevant business logic deterministically, including any repair patterns, and continues executing actions until additional user input is required

These two code snippets are all that is required to implement a task; there is no training data required for language understanding. The money transfer flow given here is a minimal example, but flows can include branching logic, function calls, calls to other flows, and more. A more complex example can be found in appendix A. Note that the flow definition does not make any reference to the user side of the conversation. Neither dialogue acts nor commands are represented. Business logic only describes the steps required to complete a task. It does not specify how the end user provides that information. While the flow only specifies the “happy path”, an assistant with this flow can already handle a large number of conversations, including repair cases like corrections, digressions, interruptions, and cancellations. This is described in section 3.3.

3.2 Dialogue Understanding

In lieu of an NLU module, our system has a Dialogue Understanding module that leverages the in-context learning abilities of LLMs. Dialogue understanding, framed as a command generation problem, improves upon intent-based NLU in key ways:

• While NLU interprets one message in isolation, DU considers the greater context: the whole running transcript of the conversation as well as the assistant’s business logic. Flow definitions and conversation state provide additional, valuable context for understanding users. This is especially useful for extracting slot values, which often requires coreference resolution.

• While NLU systems output intents and entities representing the semantics of a message, DU outputs a sequence of commands representing the pragmatics of how the user wants to progress the conversation 7.

• DU requires no additional annotated data beyond the specification of flows.

• While NLU systems assign a user message to one of a fixed list of intents, DU instead is generative, and produces a sequence of commands according to a domain-specific language and available business logic. This representation can express what users are asking with more nuance than a simple classification.

The following example requires a clarification step because the developer has created flows for multiple card-related tasks, and the user’s opening message does not provide enough information to infer which one they want:

card

Clarify(freeze card,

unfreeze card, cancel card)

Would you like to freeze or unfreeze

your card, or cancel it?

cancel

StartFlow(cancel card)

It is worth commenting on the impressive performance of our system on conversations involving corrections, especially in light of previous evidence that LLMs show poor performance on conversation repair (Balaraman et al., 2023). We believe this is because the repair-QA dataset, on which previous studies were based, is a far more challenging task that requires an LLM to produce free-form answers from the world knowledge implicit in its parameters. Handling corrections in our system only requires the LLM to reason over the conversation transcript and produce the appropriate command, with the correct slot value typically present verbatim within the prompt.