Are LLMs All You Need for Task-Oriented Dialogue?

Paper · arXiv 2304.06556 · Published April 13, 2023

We show that in explicit belief state tracking, LLMs underperform compared to specialized task-specific models. Nevertheless, they show some ability to guide the dialogue to a successful ending through their generated responses if they are provided with correct slot values.

We raise the question to what extent LLMs are capable of handling these applications off-the-shelf, i.e. without fine-tuning. We thus choose to evaluate LLM performance in the task oriented dialogue (TOD) setting, as it requires precise information handling for communicating with external APIs. Moreover, TOD systems output in-domain information which has predetermined structure and lends itself well to evaluation, thanks to pre-existing annotated data sets. We avoid any finetuning techniques and focus on zero-shot or few-shot settings using in-context learning, as this approach has lower hardware requirements and barrier of entry and better flexibility or even performance in certain tasks (Su et al., 2022).