Direct Language Model Alignment from Online AI Feedback
Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF), that do not require a separate reward model. However, the preference datasets used in DAP methods are usually collected ahead of training and never updated, thus the feedback is purely offline. Moreover, responses in these datasets are often sampled from a language model distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy. In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF), uses an LLM as annotator: on each training iteration, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback. Despite its simplicity, we demonstrate via human evaluation in several tasks that OAIF outperforms both offline DAP and RLHF methods. We further show that the feedback leveraged in OAIF is easily controllable, via instruction prompts to the LLM annotator.
Introduction. To maximise the benefits of large language models (LLMs) to society, it is important to align them with human expectations and values (Ouyang et al., 2022; Bai et al., 2022a; Bubeck et al., 2023). The first method introduced for alignment was reinforcement learning from human feedback (RLHF, Christiano et al., 2017; Stiennon et al., 2020), which trains a reward model (RM) from pairwise preferences and then optimises a policy against the RM via reinforcement learning (RL). More recently, direct alignment from preferences (DAP) methods have emerged as popular alternatives to RLHF, such as direct preference optimisation (DPO, Rafailov et al., 2023), sequence likelihood calibration with human feedback (SLiC, Zhao et al., 2023), and identity policy optimisation (IPO, Azar et al., 2023). In contrast to RLHF, the DAP methods directly update the language model (a.k.a. policy) πθ using pairwise preference data, making the alignment simpler, more efficient and more stable (Rafailov et al., 2023).
Discussion / Conclusion. To circumvent the offline feedback problem in direct alignment from preference (DAP) methods, such as DPO, we proposed Online AI Feedback (OAIF), a simple and effective way to make DAP methods online via AI feedback. We carried out an extensive empirical evaluation, using both AI and human evaluation, which showed the effectiveness of DAP methods combined with OAIF, against their offline counterparts. We also exhibited the tendency of offline DAP methods to overfit, and in contrast the usefulness of OAIF as a way to mitigate reward overoptimization. We further verified the generality of OAIF, as our empirical results hold for three prominent DAP methods: DPO, IPO and SLiC. Beyond the empirical evaluation of OAIF, our work also contributes the comparison of two types of methods: online DAP methods (e.g., online DPO) and RLAIF. Since the feedback comes from identical models in both learning algorithms, our experiment setup ensures that the AI feedback is of the same quality and that only the learning procedures differ.