MaxMin-RLHF: Alignment with Diverse Human Preferences

Paper · arXiv 2402.08925 · Published February 14, 2024

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby highlighting the generality and robustness of our proposed solution. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language models (with Tulu2-7B) and show the efficacy of the proposed approach in the presence of diversity among human preferences.

Introduction. The alignment problem, central to developing and fine-tuning current large language models (LLMs), represents a crucial challenge in artificial intelligence, especially in ensuring these models operate in harmony with human values, preferences and social welfare (Wang et al., 2023; Christian, 2020). Reinforcement learning from human feedback (RLHF) has emerged as a pivotal approach to alignment problems, specifically aligning LLM (Wang et al., 2023; Ouyang et al., 2022b; Stiennon et al., 2022a; Ouyang et al., 2022a). RLHF operates in three steps (a) supervised fine-tuning, (2) reward learning, and (3) RL fine-tuning. Step 2 learns a reward function that is expected to represent the preference feedback of the human population. However, there has been minimal emphasis on accurately representing the diversity of human preferences and the broad spectrum of user populations. As highlighted by Aroyo & Welty (2015); Aroyo et al. (2023a,b), “the notion of ‘one truth’ in crowdsourcing responses is a myth” and we need to account for the diversity in opinions and preferences.

Discussion / Conclusion. In this work, we critically examine the limitations of the conventional single-reward Reinforcement Learning from Human Feedback (RLHF) framework, particularly its insufficiency in addressing the diversity of human preferences, leading to an impossibility result for alignment with diverse preferences. To achieve a socially fair alignment in diverse human preference settings, we introduce a novel MaxMin-RLHF approach, which learns a max-min policy over mixture of reward functions to achieve a more equitable model alignment. Our experiments demonstrate the effectiveness of MaxMin-RLHF in producing socially fairer outcomes, highlighting the need for more inclusive strategies in RLHF methodologies.

MaxMin-RLHF: Alignment with Diverse Human Preferences

Synthesis notes that discuss concepts related to this paper