RLHF Workflow: From Reward Modeling to Online RLHF

Paper · arXiv 2405.07863 · Published May 13, 2024

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available.

1.1 Previous RLHF Algorithms and Their Challenges

Generally, previous RLHF methods can be largely divided into two categories: (1) deep RL-based approach using Proximal Policy Optimization (PPO) (Schulman et al., 2017; Christiano et al., 2017; Ziegler et al., 2019) and (2) (offline) direct preference learning (e.g., DPO) approaches (Zhao et al., 2023; Rafailov et al., 2023; Azar et al., 2023; Tang et al., 2024).

DRL-based framework. The DRL-based framework consists of two stages. In the first stage, a reward model is trained.

This becomes even more challenging in the context of LLMs, as fine-tuning LLMs is computationally expensive and searching the complicated hyper-parameters configuration is generally infeasible. Additionally, the PPO algorithm requires loading multiple LLMs simultaneously, including the actor (policy), critic (value network), reward model, and reference model (for KL estimation), which places significant pressure on GPU memory, especially for resource-constrained open-source projects.

Direct preference learning. In view of the above issues of PPO, there is an innovative line of work that directly learns from human preference datasets without explicitly constructing a reward function (Zhao et al., 2023; Rafailov et al., 2023; Azar et al., 2023). Among these methods, the direct preference optimization (DPO) algorithm is particularly popular.

In other words, we take the best response and the worst response as ranked by the reward model to get a preference pair. In this case, we jointly optimize the two policies to maximize their difference (measured by the uncertainty), which tends to be more efficient in practice and enjoys the same theoretical guarantee as stated in Theorem 1. This choice is similar to Hoang Tran (2024); Pace et al. (2024); Yuan et al. (2024b); Xu et al. (2024). We also drop the pair where π1 t and π2 t give the same response, which implies that the uncertainty in this direction is already small. For this round of experiments, we still use the reward function trained as the MLE of the BT reward model to rank the responses for the following reasons. First, to rank n responses, the complexity of using the reward model is linear in n, while it is far more complicated with the pairwise preference model. Second, during the early experiments, we observe significant length bias in the iterative RLHF. Therefore, we would like to explore the strategy to mitigate the length bias, and it is relatively easier to penalize the reward value with the length of the response. Finally, the BT reward model is comparable with the preference model except for the reasoning task and it may be already satisfactory for our goal. We leave a more comprehensive comparison between the BT reward model and preference model for future study.