Planning Like Human: A Dual-process Framework for Dialogue Planning

Paper · arXiv 2406.05374 · Published June 8, 2024
Conversation Architecture StructureTasks PlanningReinforcement Learning

In proactive dialogue, the challenge lies not just in generating responses but in steering conversations toward predetermined goals, a task where Large Language Models (LLMs) typically struggle due to their reactive nature. Traditional approaches to enhance dialogue planning in LLMs, ranging from elaborate prompt engineering to the integration of policy networks, either face efficiency issues or deliver suboptimal performance. Inspired by the dualprocess theory in psychology, which identifies two distinct modes of thinking—intuitive (fast) and analytical (slow), we propose the Dual-Process Dialogue Planning (DPDP) framework. DPDP embodies this theory through two complementary planning systems: an instinctive policy model for familiar contexts and a deliberative Monte Carlo Tree Search (MCTS) mechanism for complex, novel scenarios. This dual strategy is further coupled with a novel two-stage training regimen: offline Reinforcement Learning for robust initial policy model formation followed by MCTS-enhanced on-the-fly learning, which ensures a dynamic balance between efficiency and strategic depth. Our empirical evaluations across diverse dialogue tasks affirm DPDP’s superiority in achieving both high-quality dialogues and operational efficiency, outpacing existing methods.1

In response, we introduce the Dual-Process Dialogue Planning (DPDP) framework, a novel approach that incorporates two complementary planning systems: a neural policy LM model (System 1) for quick, instinctive responses to familiar situations, and an MCTS-based planner (System 2) for analytic, rational but slow planning in complex or novel scenarios. This framework allows for dynamic switching between systems based on policy LM’s uncertainty, optimizing for both efficiency and depth of strategy. Key to the success of DPDP is the enhancement of the policy model’s capability, which we address through a pioneering two-stage training approach. Initially, we employ offline RL to refine the policy model’s base, mitigating the impact of suboptimal strategies and noise prevalent in training datasets. Subsequently, we leverage MCTS simulations to guide the policy model towards generating superior strategies, thereby accelerating its convergence and enhancing overall performance. Our comprehensive evaluation across various proactive dialogue tasks unequivocally demonstrates DPDP’s superiority over contemporary methodologies, establishing new benchmarks in dialogue planning efficiency and efficacy. In summary, our contributions are threefold:

• We present a dual-system approach to dialogue planning that mirrors human cognitive processes, balancing efficiency and strategic depth.

• We develop a novel two-stage training method for the policy model, integrating offline RL and MCTS to significantly enhance its performance.

• Experimental results across two datasets validate that our proposed framework effectively outperforms a series of baselines and performs more efficiently than MCTS-based methods.