Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Paper · arXiv 2408.10075 · Published August 19, 2024

However, current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. When these differences arise, traditional RLHF frameworks simply average over them, leading to inaccurate rewards and poor performance for individual subgroups. To address the need for pluralistic alignment, we develop a class of multimodal RLHF methods. Our proposed techniques are based on a latent variable formulation - inferring a novel user-specific latent and learning reward models and policies conditioned on this latent without additional user-specific data. While conceptually simple, we show that in practice, this reward modeling requires careful algorithmic considerations around model architecture and reward scaling. To empirically validate our proposed technique, we first show that it can provide a way to combat under specification in simulated control problems, inferring and optimizing user-specific reward functions. Next, we conduct experiments on pluralistic language datasets representing diverse user preferences and demonstrate improved reward function accuracy.

While the BTL model accounts for noisy preferences, RLHF typically applies this model under the ‘unimodal’ assumption that all human preferences are derived from a single utility function. This fails to capture scenarios where preferences diverge— i.e. are multi-modal—due to fundamentally different utilities. For example, Figure 1 shows a case where one group of users prefers detailed responses, while another prefers concise ones. By performing maximum likelihood estimation under the unimodal BTL model, current methods learn a reward function that averages these multi-modal preferences (akin to mode averaging in imitation learning [51]. As we show in our experimental results, this model misspecification leads to reward models that are inaccurate, and the policies optimized on these rewards fail to accomplish tasks per any of the distinct preferences (see Figures 8, 3).

Building on techniques from variational inference [11, 38], we propose a method—Variational Preference Learning (VPL)—for multi-modal reward modeling. Intuitively, given a few preference annotations from a particular user, our approach uses a variational encoder to infer a latent distribution over hidden user context, and a latent conditional reward model to accurately recover the true multi-modal preference distribution. We derive an evidence lower bound (ELBO) for latent-variable preference-based reward optimization. Our proposed algorithm, VPL, effectively learns a distribution of reward functions from large corpora of preferences provided by diverse users.

In developing practical training methods for such latent-conditioned reward models, we show that several complexities and technical considerations arise. A key problem is that binary comparisons inherently lack information regarding the scale of rewards. Under the BTL model (and correspondingly the VPL model), the preference label between two alternatives A and B can only provide information about rA − rB. So, the learned reward function may have vastly varying reward scales across individual users that adversely affect the optimization landscape of multi-user reinforcement learning [31, 69] using these rewards. To mitigate this, we show how a simple pairwise classification scheme [61, 49] can appropriately bound and scale reward estimates within our latent variable framework, thereby enhancing the performance of downstream learned policies. The predicted latent distribution enables several additional capabilities.