Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents

Paper · arXiv 2311.00262 · Published November 1, 2023

Proactive dialogues serve as a practical yet challenging dialogue problem in the era of large language models (LLMs), where the dialogue policy planning is the key to improving the proactivity of LLMs. Most existing studies enable the dialogue policy planning of LLMs using various prompting schemes or iteratively enhance this capability in handling the given case with verbal AI feedback. However, these approaches are either bounded by the policy planning capability of the frozen LLMs or hard to be transferred to new cases. In this work, we introduce a new dialogue policy planning paradigm to strategize LLMs for proactive dialogue problems with a tunable language model plug-in as a plug-and-play dialogue policy planner, named PPDPP. Specifically, we develop a novel training framework to facilitate supervised fine-tuning over available human-annotated data as well as reinforcement learning from goal-oriented AI feedback with dynamic interaction data collected by the LLM-based self-play simulation. In this manner, the LLM-powered dialogue agent can not only be generalized to different cases after the training, but also be applicable to different applications by just substituting the learned plug-in. In addition, we propose to evaluate the policy planning capability of dialogue systems under the interactive setting. Experimental results demonstrate that PPDPP consistently and substantially outperforms existing approaches on three different proactive dialogue applications, including negotiation, emotional support, and tutoring dialogues.1

However, as LLMs are trained to passively follow users’ instructions, dialogue agents built upon them typically prioritize accommodating users’ intention. Therefore, LLM-powered dialogue agents often face challenges in handling proactive dialogue problems that require the dialogue agent to strategically take the initiative to steer the conversation towards an anticipated goal (Deng et al., 2023a), such as negotiation (Zhan et al., 2022), emotional support (Liu et al., 2021; Zheng et al., 2023), and tutoring (Macina et al., 2023).

Despite their effectiveness in improving the dialogue policy planning, there are several challenges that remain to be tackled. 1) LLMs fall short of planning effective dialogue policy with zero-shot or few-shot prompting schemes (Deng et al., 2023b). Therefore, the improvement of accomplishing the goal will be limited by the planning capability of the frozen actor LLM. 2) Existing approaches based on iterative refinement (Fu et al., 2023; Yu et al., 2023) lack of transferability, as multiple rounds of self-play dialogue simulations are required for every new-coming case to plan a satisfactory strategy for it, which is impractical in real-world applications. 3) Existing studies typically evaluate the performance of dialogue agents in terms of turn-level response quality measurements based on fixed reference responses. However, these evaluation protocols fail to automatically assess the policy planning capability of the dialogue agent, which is determined by the effectiveness and efficiency of the goal achievement in multi-turn conversations.

Dialogue Policy Planning Dialogue policy planning has been widely-studied in task-oriented dialogues (Jang et al., 2022; Feng et al., 2023) and conversational recommendation (Gao et al., 2021; Deng et al., 2021), where the interaction process can be easily abstracted into a sequence of slots and values (e.g., location, price, etc). Meanwhile, the success of planning is objective, such as whether the system provides an appropriate entity/item. However, in proactive dialogues (Deng et al., 2023a; Liao et al., 2023), there is no pre-defined agenda or schema for simplifying the multi-turn interaction. Instead, the natural language interaction requires more complex reasoning and certain domain knowledge (e.g., psychological or pedagogical skills). Moreover, the planning outcome is rather subjective, such as learning gain during tutoring, emotional intensity relaxation during counselling. Therefore, it imposes more difficulties in planning optimal dialogue policy in proactive dialogues. In order to mimic the behaviors of human experts, corpus-based fine-tuning approaches are typically adopted for predicting the dialogue strategies (Joshi et al., 2021; Cheng et al., 2022; Wang et al., 2023c). As summarized in Table 1, we differentiate our method from the recent LLM-based policy planning methods in terms of seven perspectives. General policy planning methods typically optimize towards an objective goal in a single-turn interaction, such as ROUGE score in summarization (Li et al., 2023) or accuracy in QA (Shinn et al., 2023; Yao et al., 2023). As for dialogue policy planning methods, Chen et al. (2023) validate the effectiveness of mixed-initiative strategy-based prompting in proactive dialogue problems. Some methods Wang et al. (2023b); Deng et al. (2023b); Zhang et al. (2023a) prompt LLMs to conduct self-thinking of policy planning for the next turn, ignoring the long-term conversation goals. Fu et al. (2023) conduct self-play simulation to iteratively refine the policy planning with long-term feedback. However, this type of iterative refinement is exclusive to each individual case, but not transferable to new situations. Moreover, the policy planning capability in LLM-powered dialogue agents cannot be improved by these methods, as all parameters are frozen and not learnable. As for the proposed PPDPP, a learnable language model plug-in can be fine-tuned for improving the policy planning capability without affecting other functionalities of LLM-powered dialogue agents.

To tackle these challenges, we introduce a novel dialogue policy planning paradigm to strategize LLMs with a tunable language model plug-in, named Plug and- Play Dialogue Policy Planner (PPDPP). As shown in Figure 1(b), PPDPP acts as the policy agent to predict the dialogue strategy at the next turn for the dialogue agent, which can be first supervisedly fine-tuned with available human-annotated corpora. Then, we employ the self-play paradigm to prompt two LLMs (an assistant and a user) with various case background information to perform the role-playing conversation that simulates the dynamic environment of multi-turn interactions between the dialogue agent and the real user. For each case, these two LLMs are each tasked with distinct, often competing goals (e.g., in negotiation dialogues, the buyer seeks to attain a more favorable price, whereas the seller endeavors to secure a higher price). Meanwhile, a third LLM acts as the reward model to provide goal-oriented verbal feedback, indicating the goal achievement, which is transformed to scalar rewards used for reinforcement learning (RL). When reaching the goal or the maximum conversation turn, we leverage RL algorithm to further tune the policy agent with the collected interaction data and the goal-oriented AI feedback. In this way, the LLM-powered dialogue agent can not only exhibit more adaptability to various new cases than prompt-based approaches, but also find utility across diverse applications simply by shifting the tuned plugin without affecting the LLM’s exceptional capabilities of context understanding and response generation.

Learnable Plug-ins for Large Language Models Due to the black-box nature of commercial LLMs and the high expenses of fine-tuning the whole open-source LLMs, a recent trend in improving certain capabilities of LLMs is to investigate the utility of external plug-ins, such as APIs (Schick et al., 2023), vision models (Wu et al., 2023), or functional models from Huggingface (Shen et al., 2023). However, these plug-ins fail to learn from valuable feedback to iteratively enhance the their capabilities, resulting in performances that are solely dependent on the quality of fixed plug-ins.

most of existing studies (Shinn et al., 2023; Fu et al., 2023; Madaan et al., 2023; Hao et al., 2023) directly leverage the generated natural language feedback from LLMs to self-refine the task instruction prompt, instead of obtaining a scalar reward for training the model. In this work, we propose the goal-oriented AI feedback for facilitating RLAIF under the context of dialogue systems, which not only transforms the textual feedback into scalar rewards, but also capture long-term goal-oriented rewards that obtain from the dynamic multi-turn interactions, instead of AI preference on single-turn responses.