Enhancing personalized multi-turn dialogue with curiosity reward

Paper · arXiv 2504.03206 · Published April 4, 2025

Current training methods like Reinforcement Learning from Human Feedback (RLHF) prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized interactions. Traditional approaches to personalization often rely on extensive user history, limiting their effectiveness for new or context-limited users. To overcome these limitations, we propose to incorporate an intrinsic motivation to improve the conversational agents’ model of the user as an additional reward alongside multi-turn RLHF. This reward mechanism encourages the agent to actively elicit user traits by optimizing conversations to increase the accuracy of its user model. Consequently, the policy agent can deliver more personalized interactions through obtaining more information about the user. We applied our method both education and fitness settings, where LLMs teach concepts or recommend personalized strategies based on users’ hidden learning style or lifestyle attributes.

Despite this importance, most existing approaches to personalize LLMs require substantial pre-collected user data or profiles. Recent works on aligning models to user-specific preferences often assume access to a user profile, history, or latent representation gathered prior to the conversation (Poddar et al., 2024; Wu et al., 2024; Chen et al., 2025; Sun et al., 2025; Shenfeld et al., 2025). For example, reward-modeling techniques have been proposed to infer latent user clusters or employ user-specific fine-tuning, but these typically involve additional training on feedback data from each user ahead of time

In this paper, we propose to develop a novel method for enhancing LLMs’ ability to conduct personalized multi-turn conversations. We posit that a good conversational agent should treat the interaction itself as an opportunity to learn about the user. As the dialogue progresses, the LLM should actively gather information about the user’s preferences, personality, or other relevant attributes—and adapt its responses accordingly. To achieve this, we draw inspiration from intrinsic motivation in reinforcement learning (Houthooft et al., 2016). In particular, we introduce an intrinsic reward signal that encourages the LLM to ask insightful questions and make context-sensitive responses aimed at uncovering the user’s characteristics. Intuitively, the agent is rewarded for reducing its uncertainty about the user. This mechanism drives the model to balance helpfulness with inquisitiveness: rather than only responding passively, it will occasionally probe or adjust its style to better personalize the conversation. Figure 1 illustrates this concept: in order to realize personalized conversations, we propose to reward an LLM by its improvement in belief over the user type after performing an action. This is an extra turn-based reward beyond the original end-of-conversation sparse reward, that incentivizes the model to prioritize learning about the user.

We augment the standard RLHF training with an auxiliary objective: the model is tasked not only with maximizing the human feedback reward, but also with predicting the student’s latent profile (e.g. learning style or knowledge level) from the dialogue. This auxiliary reward provides an intrinsic drive for the model to personalize its strategy to each user. The result is a policy that learns to adapt its teaching style—for example, being more patient, more detailed, or more encouraging—based on the inferred student type. In Exercise Recommendation, a fitness agent (the LLM) recommends personalized exercise strategies to users (simulated by another LLM) during conversations. These recommendations are tailored based on the user’s lifestyle attributes, such as age, personality, and injury status, which influence the optimal exercise strategy

For example, Poddar et al. (2024) introduce variational preference learning (VPL) to infer a latent context vector for each user, enabling the reward model (and policy) to adjust to that user’s revealed preferences. Similarly, Chen et al. (2025) develop a pluralistic alignment framework (PAL) that learns a latent preference space covering heterogeneous user opinions; their method trains a reward function that can generalize to new users with a few examples by modeling each user as a point in this latent space. Another approach is to factorize the reward function into a combination of shared components: Shenfeld et al. (2025) present Personalization via Reward Factorization (PReF), which represents an individual’s reward as a weighted sum of base reward functions and uses a small number of preference queries (e.g., about 10) to infer the user-specific weights. Wu et al. (2024) developed Reinforcement Learning from Prediction Feedback (RLPF), which extracts reward signals from downstream personalization tasks to generate natural language user profiles, which are then used to personalize LLMs. These personalized alignment methods indeed tailor an LLM’s behavior to different users, but they require additional user-specific information or prep work before the personalized interaction can take place.

In contrast, our method does not require any separate calibration or auxiliary user profile in advance. The personalization of the agent emerges dynamically through multi-turn interactions: as the conversation unfolds, the model infers the user’s traits and preferences and adapts its responses accordingly. This on-the-fly learning of user preferences means our approach can personalize in real-time without an upfront personalization phase, which is a key differentiator from prior RLHF-based personalization techniques. Such ability to actively infer user preferences during the conversation can bring additional benefits in openended dialogs. In the absence of a clearly defined task, the enjoyability of the interaction itself becomes an important consideration. Encouraging users to voluntarily share personal ideas can enhance their engagement and overall enjoyment of the conversation, which is not realizable for traditional approaches that primarily focus on helpfulness and harmlessness.

The dataset construction involved three key steps:

User Attribute Definition and Sampling: We defined 20 user attributes encompassing a range of personal characteristics, including age, socioeconomic status, personality traits, occupation, physical limitations, and hobbies. For each simulated user, we randomly sampled values for each of these attributes, creating a diverse user population.
Ideal Strategy Derivation: To simulate a high-quality, attribute-driven exercise strategy classifier, we established a deterministic logic rule that maps user attributes to an ideal exercise strategy. For example, we may recommend a team sport for those who are outdoorsy and extroverted. Among the 20 defined attributes, 5 were designated as relevant factors influencing the recommendation, while the remaining 15 served as background characteristics, emulating the complexity of real-world scenarios.
User Backstory Generation: To provide contextual richness and ensure consistent agent behavior, we utilized the Gemini 1.5 Pro model (Team et al., 2024a) to generate a detailed backstory for each user based on their attribute values. These backstories were then used in prompts for the environment model, ensuring that the environment model remained consistent with the user’s defined characteristics throughout the conversation.