Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users’ preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model. We present a novel framework, Preference Learning Using Summarization (PLUS), that learns text-based summaries of each user’s preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. We train the user-summarization model with reinforcement learning, and update the reward model simultaneously, creating an online co-adaptation loop. We show that in contrast with prior personalized RLHF techniques or with in-context learning of user information, summaries produced by PLUS capture meaningful aspects of a user’s preferences. Across different pluralistic user datasets, we show that our method is robust to new users and diverse conversation topics. Additionally, we demonstrate that the textual summaries generated about users can be transferred for zero-shot personalization of stronger, proprietary models like GPT-4. The resulting user summaries are not only concise and portable, they are easy for users to interpret and modify, allowing for more transparency and user control in LLM alignment.
One way to achieve pluralistic alignment is to condition the reward model on information about the user. Reward modeling is a key component of many existing algorithms for LLM alignment (e.g., RLHF, Best-of-N sampling [19, 11]). Therefore, training the reward model to capture the diversity of user preferences can lead to better generation and selection of the LLM’s responses. Rather than learning a single reward model on the entire user dataset, as is typically done by the standard BTL model, we can learn a user-conditioned reward model that better captures the diversity of human preferences. Mathematically, this means the reward model is conditioned on a latent variable that captures variability across users. Prior work [10] has considered providing the reward model with the user’s self-stated preference context. However, this information may not always be available, as users may not know how to best communicate their preferences to LLMs. Poddar et al. [13] used an embedding vector to represent this latent variable based on the user’s past preference data. However, it can be challenging to train a large language model to compress text into a single embedding vector without losing performance [4].
A more natural approach which retains the strengths of LLMs to reason with text, is to learn text-based user summaries to act as the latent user variable to condition the reward model. While it might be possible to simply prompt an LLM reward model with an automatically generated user summary, our experiments reveal that automatic summaries tend to focus on the topic of the conversation and do not contain key details needed for the reward model to accurately determine the user’s unique preferences that can guide future conversations.
We propose to learn how to summarize important information about the user using reinforcement learning (RL) fine-tuning. We use RL to train an LLM to summarize each user’s past conversations, using the predictive accuracy of the summary-conditioned reward model as the training signal for determining which aspects of the past conversations give meaningful information about the user’s preferences. Our algorithm consists of two components: a summarizer trained with Proximal Policy Optimization (PPO) [15] and a summary-conditioned reward model, which provides the reward signal for training the summarizer. The key challenge is simultaneously updating both the summarizer and the reward model so that the summary can be optimized based on the reward model’s prediction accuracy, and the reward model can be improved based on the generated summary of the user’s preferences. This allows the reward model to leverage concise summaries about the user, rather than the long and noisy dialogue history, while providing the summarizer with an appropriate signal for deciding which aspects of past conversations to focus on to guide future interactions.
• Novel pluralistic alignment algorithm, PLUS (Preference Learning Using Summarization), that uses RL to jointly learn user summaries and train a reward model conditioned on them. The summaries and reward model co-evolve: the summarizer is trained to generate representations that are more informative for user’s preference modeling, while the reward model simultaneously adapts to leverage the evolving summaries. This allows the model to capture the most meaningful dimensions of the user’s preferences from any text input, including past conversations, user’s self-stated preferences, and survey responses.
Reinforcement Learning from Human Feedback (RLHF). Our work builds on reinforcement learning from human feedback (RLHF) [12], which models binary human preferences using the Bradley-Terry-Luce (BTL) model [3]. The BTL model assumes the user provides preference ratings in terms of which of two responses they like better; it then learns a reward model (RM) that estimates the magnitude of the user’s preferences using this data. This is then used to train an LLM policy to optimize responses based on the reward signal. RLHF has been effective in refining language models’ responses to better align with user-preferred attributes, like helpfulness and truthfulness. While most existing works [1, 12, 22, 16] (including Direct Preference Optimization [14], which avoids learning a reward model and instead directly learns an optimal policy under the assumed reward model) have assumed a single reward model to capture all user data, we are interested in personalized RLHF via a user-conditional reward model, similar to Poddar et al. [13].
Pluralistic Alignment of LLMs. Despite the impressive qualities of large language models’ outputs, several works [2, 8, 9, 13, 17] have pointed out diverse underlying preferences in user data, which can conflict with each other in terms of what individual users consider ideal responses. These differences may surface along various dimensions, including value alignment and writing style. For example, some users might prefer honest AI systems that calibrate their responses with uncertainty, rather than always sounding confident. In contrast, other users might prefer systems that try to be as informative as possible, even if some of the information requires additional fact-checking by the user. Fitting a single reward model fails to account for these differences, especially when they lead to meaningful semantic changes in the responses. In particular, PRISM [9] presents a heterogeneous dataset of 1,500 participants from 75 countries interacting with 21 different LLMs, where users vary in how much they prefer model attributes, such as language fluency, factuality, creativity, and helpfulness.
We evaluate our approach using this dataset, as well as other settings explored in prior work [13, 5]. Personalized RLHF. Recent techniques have been developed for personalized RLHF. For example, [13] embeds a user’s past preference logs into a user-specific vector representation and learns a reward model conditioned on this embedding. However, training a large language model to compress text into a single embedding vector without losing performance can be challenging [4]. Similar to [13], [10] also learns embeddings to represent each user, but decomposes the user model into two parts: one conditioned on the user’s index (with no textual information about the user) and the other conditioned on the user’s textual information. However, their method does not extend to settings where only the user’s past conversation history or preference logs are available, as it assumes that textual information about the user, such as demographic data or self-stated preference descriptions, is provided. In cases where users do not explicitly state their preferences, we want to infer their preferences from past interaction data and represent the learned preferences as a string that users can modify.
Wu et al. [20] fine-tunes LLMs using reinforcement learning to achieve personalization for user recommendations. Specifically, they generate user-specific summaries from historical product data, such as movie and book ratings. These summaries are evaluated based on how accurately a prediction model can forecast the user’s future ratings, conditioned on the generated summaries; this predictive accuracy also serves as a training signal for the summarization model. However, their approach assumes a static prediction model, whereas our method jointly learns both the summarizer and the reward model.
Our key idea is to capture user preference variability through a textual representation that can condition the reward model, enabling personalization. This requires simultaneously learning (1) a reward model conditioned on user summaries, and (2) a summarizer that transforms information about the user into a textual representation that provides useful signal for reward learning in step (1). How to construct user context? Prior work [13] mainly focuses on one particular type of user context information, which is the user’s preference labels over past conversation history. We assume that the user context c is in the form c = {(s1 A, s1 B), (s2 A, s2 B), ..., (smA , smB )}. For each i, si A and si B form a preference pair (without loss of generality, we assume A is the chosen and B is the rejected response). These pairs can be collected from different conversations or different turns within the same conversation. While we focus on the same input type, PLUS can take in any flexible textual input beyond past preference labels. In our experiments, we explore this flexibility by augmenting the user context with additional textual information – such as relevant attribute names and the user’s self-stated preferences, when applicable.
past conversation examples can hurt the reward model’s generalization when the new conversations cover different topics; and zero-shot summaries tend to focus on specific conversation topics, rather than aspects that are relevant for predicting the user’s future preferences.
Why learning summaries instead of directly using user context?: The most straightforward ideas include using the user context information directly as in-context learning examples, or using a pretrained LLM to generate zero-shot summaries of the user’s conversation history. Indeed, these are the baselines we try in our experiments. However, past conversation examples can hurt the reward model’s generalization when the new conversations cover different topics; and zero-shot summaries tend to focus on specific conversation topics, rather than aspects that are relevant for predicting the user’s future preferences.
In many cases, z is not readily available. Alternatively, we could ask the user to craft a system message like: “I value conciseness and factuality over fluency", which could serve as the z to condition the reward model. But when such preferences are not explicitly stated, where can we find the signal for inferring the user-specific z? Given a long user history, how does the AI assistant know which parts of the past conversation to attend to when personalizing its response for future conversations? Our key insight is that the summarizer can learn to extract relevant information from past conversations and generate z to guide future responses.
A system designer may know which attribute dimensions could be relevant for personalization, but not which ones actually matter to a specific user. To simulate this scenario in the UltraFeedback dataset, we include the list of attributes (helpfulness, honesty, truthfulness, and instruction-following) in the prompt and ask the summarizer to identify which attribute is most valuable to this user, without revealing the user’s true preferences.
6.2 Does RL fine-tuning improve LLM-generated user summaries for preference learning? PLUS with the untrained summarizer sometimes fails to identify the relevant dimensions of the user’s preferences, and instead, focuses on the conversation topics. Specifically, with Pets, we observe that the untrained summarizer outputs, “The user seems to appreciate traits that denote affectionate and playful qualities in pets," which broadly applies to both pets, or it incorrectly identifies the user’s preference as valuing “short, factual information." In contrast, the trained summarizer can clearly state the user’s preferences between cats and dogs (e.g., “ interested in information about cat behavior and properties, excluding topics related to dogs"), allowing the summaries to generalize to any future statements about their unseen traits.
Limitations. However, it still remains a challenge to accurately model real human data in PRISM; the small dataset size, compared to the high variability in user preferences and conversation topics, makes pluralistic reward learning very difficult, even with the user’s self-stated preference data.