Language Model Personalization via Reward Factorization

Paper · arXiv 2503.06358 · Published March 8, 2025

Modern large language models (LLMs) are optimized for human-aligned responses using Reinforcement Learning from Human Feedback (RLHF). However, existing RLHF approaches assume a universal preference model and fail to account for individual user preferences, limiting their effectiveness in personalized applications. We introduce a framework that extends RLHF to enable user personalization by leveraging the assumption that user preferences lie in a low-dimensional space. Instead of training a separate model per user, we represent user-specific rewards as a linear combination of base reward functions. Using only 10 user responses, our method can infer user-specific rewards and align LLM outputs accordingly.

In PReF, we first collect preference data in which each pair of responses to the same prompt is annotated with the user’s preference (i.e., which response is preferred) and the user’s identity. We use this dataset to learn the base reward functions. Once the base reward functions are determined, the next step is to infer the coefficients for each new user. To achieve this, we generate a sequence of questions and a pair of responses and ask the user to indicate which response they prefer. Based on the responses, we estimate the user coefficients and, thus, its specific reward function.

The challenge in estimating the user’s coefficients is to do so with a minimum number of questions presented to the user. To do so, we adopt an active learning approach where the sequence of answers is adaptive to the user, meaning that the questions are selected based on the user’s prior responses to efficiently refine their preference model. Specifically, we select a question and responses that minimize the uncertainty of the user’s coefficients. We adapt and extend results from the logistic bandit literature to efficiently compute uncertainty scores of response pairs. Using our method, we can determine the user coefficients using only 10-20 questions.

Once the user-specific reward function has been identified, the next step is to align the LLM to it. We leverage recent advances in inference-time alignment methods (Han et al., 2024; Yang et al., 2024b; Rame et al., 2024) that can generate reward-aligned responses from an LLM at deployment without modifying the weights of the LLM. This allows for efficient, scalable adaptation to individual users without requiring costly model updates.