RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
One obstacle for employing RLHF at scale is its dependence on high-quality human preference labels. Modern large language models (LLMs) have shown a high degree of alignment with human judgment (Gilardi et al., 2023; Ding et al., 2023), suggesting that LLM-generated preference labels may be a viable substitute for human labels
We find that soliciting chain-of-thought reasoning (Wei et al., 2022) consistently improves alignment, while using a detailed preamble and few-shot prompting (Brown et al., 2020) are only beneficial for certain tasks.