Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Paper · arXiv 2506.19652 · Published June 24, 2025

In this work, we propose a novel framework that integrates large language models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a specific goal. By leveraging hierarchical reinforcement learning to model the structured phases of dialogue and employ meta-learning to enhance adaptability across diverse user profiles, our approach enhances adaptability and efficiency, enabling the system to learn from limited data, transition fluidly between dialogue phases, and personalize responses to heterogeneous patient needs. We apply our framework to Motivational Interviews, aiming to foster behavior change, and demonstrate that the proposed dialogue manager outperforms a state-of-the-art LLM baseline in terms of reward, showing a potential benefit of conditioning LLMs to create open-ended dialogue systems with specific goals.

Although these models exhibit remarkable language generation capabilities, they also present significant limitations, many of which can be addressed through insights from "traditional" dialogue research. In particular, LLMs often lack the controllability and structured decision-making of conventional rule-based systems, which are more predictable and interpretable (Shidara et al., 2020). Rule-based domain-specific dialogue systems (Hadi et al., 2024) offer notable advantages, including improved controllability, explainability, and the ability to integrate expert knowledge.

In this work, we investigate a hybrid approach in which an RL-based dialogue manager governs an LLM to simulate MI dialogues, aiming to balance adaptability and control for more effective virtual therapy support.

Complex dialogues, such as those in Motivational Interviewing (MI), evolve through distinct phases, each guided by unique long-term strategies (Miller and Rollnick, 2012). The dialogue usually begins with an engaging phase, where rapport is established, and patient engagement with the therapist is fostered. This is followed by a focusing phase, where core issues, their underlying causes, and the patient’s background are identified to set a clear focus for the conversation. The third phase is the evoking phase, which involves encouraging the patient’s motivation for change by eliciting and amplifying “change talk”. Finally, planning involves developing a specific, actionable plan for behavior change based on the patient’s motivation and goals. Therapists must ensure that specific objectives, such as achieving high levels of engagement, clarifying core issues, and cultivating sufficient motivation, are met before transitioning between phases.

As proposed by (anonymous, 2024a), these patients can be classified into three categories: Open-to-Change, Resistant-to-Change, and Receptive. Open-to-Change individuals demonstrate a strong willingness to modify unhealthy behaviors.

Steenstral et al. (Steenstra et al., 2024) identified the limitations of rule-based approaches in maintaining adherence to therapeutic protocols and proposed leveraging LLMs for this application, demonstrating promising results.

we propose a novel framework that integrates LLMs with an RL-based dialogue manager to structure MI dialogues across different phases while dynamically adapting to diverse patient profiles. By synergizing structured control with generative flexibility, our approach enhances adaptability and efficiency, enabling the system to learn from limited data, transit fluidly between MI phases, and personalize responses to heterogeneous patient needs

This hierarchical structure ensures dynamic and context-sensitive dialogue management, allowing real-time adjustments to both the patient’s evolving needs and the interaction context. It balances global objectives, such as increasing motivation for behavior change, with more localized goals, such as answering patient inquiries.

4.4.2 Training Framework

The training framework leverages a model-based reinforcement learning (RL) approach. A model-based approach enables efficient reuse of dialogue turns across multiple iterations as policies evolve. Specifically, we utilize the Soft Actor-Critic (SAC) algorithm (Haarnoja et al., 2018), which enhances the system’s adaptability to new human users in online interactions. This approach allows for policy updates at each turn, maintaining the information from previous turns. Each training epoch targets a specific user profile and begins with cloning the master policy θ. The optimization process occurs in two phases. In the first phase, the master policy θ is fixed, and the sub-policies ψ0, . . . , ψn are optimized using SAC. In the second phase, the sub-policies remain fixed while the cloned master policy θclone is optimized using SAC. After these optimizations, the updated policies are evaluated. Finally, the master policy θ is updated using the MAML algorithm. The complete training process is detailed in Algorithm 2.

5.2.1 Simulate patients in MI Simulating patients in MI has been explored in prior research. For instance, (anonymous, 2024c) proposed a prompt to simulate a user with a large language model (LLM) (anonymous, 2024b). This approach has been validated to produce contextually relevant, natural dialogue acts and utterances (anonymous, 2024b,c), although the differences between user profiles have not been tested and might be a limitation of this simulator. We use this user simulator to train and evaluate our dialogue manager. In the following of this article, the term user refers to this user simulator.

5.2.2 Action Space The agent operates in a discrete action space consisting of 13 possible dialogue acts, which are categorized into task-oriented dialogue acts and socially oriented dialogue acts. Task-oriented dialogue acts include Asking for Consent or Validation, Providing Medical Education and Guidance, Planning with the Patient, Giving a Solution, Asking about Current Emotions, Inviting a Shift in Outlook, Asking for Information, and Reflection. Socially-oriented dialogue acts include Empathic reactions, Acknowledging Progress and Encouraging, Backchanneling, Greeting or Closing, and Normalizing Experiences while Providing Reassurance. This taxonomy was introduced in (anonymous, 2024a).

5.2.3 State Space

The agent’s state space includes information from the most recent agent’s and user’s dialogue acts. User can use 9 different dialogue acts: Changing Unhealthy Behavior, Sustaining Unhealthy Behavior, Sharing Negative/Positive Feelings or Emotions, Sharing Personal Information, Realization or Understanding, Greeting or Closing, Backchanneling, and Asking for Medical Information. Additionally, the state space incorporates the current timestamp and an encoded representation of the dialogue context, which comprises the last three utterances.

5.2.4 Master State Space

The master policy’s state space is composed of an approximation of Context knowledge, Engagement approximation, and Evocation approximation. Context knowledge approximation is measured by the number of times the user employs the Sharing Personal Information dialogue act. Engagement approximation is determined by the number of times the user utilizes the Sharing Positive/Negative Feelings dialogue act. Evocation approximation is quantified by the number of uses of the Understanding or New Perspective dialogue act.

5.2.5 Reward Function

The reward function is designed to predict therapy outcomes by assigning specific values to different user dialogue acts. Prior research underscores the critical role of user responses, such as sustain talk, which is linked to poorer treatment outcomes (Magill et al., 2014), and change talk, which is associated with reduced risk behaviors during follow-up assessments (Magill et al., 2018). Additionally, the reward function incentivizes structured progression through the MI phases. A reward of +5 is assigned for Changing Unhealthy Behavior, as this represents the desired outcome, whereas a penalty of −5 is given for Sustaining Unhealthy Behavior, which should be discouraged. In the Engagement phase, a reward of +50 is granted for expressing feelings. Once at least two emotions have been expressed by the user, a reward of +100 is assigned for providing information in the Focusing phase. After at least two pieces of information have been shared, a reward of +150 is given for evoking-related dialogue acts, culminating in a reward of +200 for planning-related dialogue acts.

Dialogue acts associated with the Engaging phase, such as asking about emotions or sharing emotions, should be more prevalent at the beginning of the conversation, whereas those related to the Planning phase, such as providing solutions or promoting behavior change, should appear more frequently towards the end (Miller and Rollnick, 2012). While Engaging should occur throughout the entire dialogue, the later stages should be more focused on Planning.

For the model trained without meta-learning, we observe a collapse of the master policy to a single dominant action across interactions. This highlights the difficulty of learning a generalized policy that performs well across diverse user profiles without explicit mechanisms for adaptation. In contrast, the use of meta-learning allows the master policy to maintain variability and adaptability in its actions.

The models effectively differentiate between distinct phases, initially engaging in Engaging/Focusing phases (Master action 5) before transitioning into Focusing/Planning phases (Master action 3) and Evoking phase (Master action 4).

We leverage HRL to model the structured phases of MI and employ meta-learning to enhance adaptability across diverse user profiles. Our findings demonstrate that the proposed dialogue manager outperforms an LLM baseline in terms of reward. Additionally, our analysis of the generated conversations provides valuable insights into how HRL and metalearning contribute to the structured yet adaptive nature of the dialogue.