Reward-Robust RLHF in LLMs

Paper · arXiv 2409.15360 · Published September 18, 2024

As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI). However, the reliance on reward-model-based (RM-based) alignment methods introduces significant challenges due to the inherent instability and imperfections of Reward Models (RMs), which can lead to critical issues such as reward hacking and misalignment with human intentions. In this paper, we introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges, paving the way for more reliable and resilient learning in LLMs. Our approach introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions. This allows the framework to integrate both nominal performance and minimum reward signals, ensuring more stable learning even with imperfect reward models. Empirical results demonstrate that our framework consistently outperforms traditional RLHF across diverse benchmarks,

The standard RLHF framework is delineated into two principal phases: 1) Reward Model (RM) training from Human/Artificial Intelligence Feedback: In this phase, a RM is trained by preference data, typically utilizing Maximum Likelihood Estimation (MLE). 2) RM-based Reinforcement Learning: Subsequently, the model employs the established deep reinforcement learning algorithm, Proximal Policy Optimization (PPO) Schulman et al. (2017), to refine performance based on the reward function determined in the preceding phase.

several issues that arise from an imperfect RM: 1) Reward hacking: This occurs when the model exploits loopholes in the reward function, optimizing for unintended behaviors that maximize the reward signal without genuinely improving task performance. 2) Overfitting and underfitting: An overfitted RM can capture noise or specific patterns in the training data that do not generalize well to new data, while an underfitted model fails to capture important patterns altogether, leading to poor decision-making.

Here, Jperform(θ) measures the nominal performance, based on signals from a nominal reward model (the reward model used as an approximation of the golden one regardless of the uncertainty), while Jrobust(θ) assesses the worst-case performance across all possible reward functions within an uncertainty set. We propose Bayesian Reward Model Ensembles (BRME) to characterize the nominal reward function and the uncertainty set: We train a multi-head RM, where each head outputs the mean and standard deviation of a Gaussian distribution, from which the final reward is sampled. BRME is trained with a Mean Square Error (MSE) loss. BRME has two main advantages over traditional RMs that are trained with MLE loss and output a scalar as the reward: 1) We proved that the standard deviation can reflect the confidence of the head on its output reward, thus the output with the lowest standard deviation can be reasonably chosen as the nominal reward. 2) We show that both the coverage of the reward distribution and the accuracy on preference test set are superior to that with the traditional RMs.