RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Paper · arXiv 2309.00267 · Published September 1, 2023

One obstacle for employing RLHF at scale is its dependence on high-quality human preference labels. Modern large language models (LLMs) have shown a high degree of alignment with human judgment (Gilardi et al., 2023; Ding et al., 2023), suggesting that LLM-generated preference labels may be a viable substitute for human labels

We find that soliciting chain-of-thought reasoning (Wei et al., 2022) consistently improves alignment, while using a detailed preamble and few-shot prompting (Brown et al., 2020) are only beneficial for certain tasks.