Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Paper · arXiv 2310.01468 · Published October 2, 2023

Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs’s capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This entity-deducing game can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of language models. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model.

In uncertain circumstances, intelligent conversational agents may need to take the initiative to reduce their uncertainty by asking good questions proactively, thereby solving problems more effectively. This requires intricate, interactive, strategic decision-making and reasoning about the agent’s next move in a multi-turn conversation. This capability is crucial in various applications, such as multistep task completion, task-oriented chatbots, recommendations, and conversational search.

Traditionally, dialogue systems, including the clarification process, have been achieved by modularizing various aspects of such tasks into sub-tasks such as natural language understanding, state tracking, planning (policy learning), and response generation. However, recent advances in LLMpowered systems have made it possible to create an end-to-end pipeline, opening up new possibilities for developing autonomous agents that can complete complex tasks using enhanced planning and memory capabilities.

agent has three objectives: 1) accurately assess the current dialog state; 2) eliminate ambiguity in user’s intent and satisfy the user demand by asking strategic questions; 3) ask as few questions as possible

In this study, we investigate this somewhat overlooked research problem – how good the LLMs are at asking questions and deducing intent. We propose to use entity-deducing games, specifically the 20 questions game (Q20) (Akinator, 2007), to assess the complex reasoning and strategic planning capability of LLMs in formulating precise questions/guesses over long conversations (Figure 1). This game requires a model to infer an unidentified entity through a sequence of questions that elicit simple responses of “Yes”, “No” or “Maybe” with as few queries as possible. To achieve this, the model must be able to track the dialogue state over turns, and use its reasoning and planning skills to effectively partition and narrow down the search scope.

A proficient G necessitates several multi-turn dialogue capabilities working in synergy:

1. State Tracking and Understanding: G must comprehend multi-turn context, track asked questions, and understand its position in the game and coreference resolution.

2. Strategic Planning: G needs to strategically ask questions to progress efficiently towards a better state, avoiding redundant queries and ensuring consistency with prior knowledge.

1. Inductive Reasoning: G must use conversation comprehension to generate conjectures based on acquired knowledge. G must inherently establish a taxonomy representation to efficiently and accurately identify the correct entity among numerous options.