Simple Synthetic Data Reduces Sycophancy In Large Language Models

Paper · arXiv 2308.03958 · Published August 7, 2023

“Language models have seen significant advancement in recent years, including the capacity to solve complex tasks that require reasoning (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023; Google, 2023; Touvron et al., 2023, inter alia). As these models may one day be able to solve problems that humans cannot solve, it is important to ensure that models are aligned and avoid reward hacking (Amodei et al., 2016; Saunders et al., 2022; Bowman et al., 2022), such as exploiting the preferences of human raters (Amodei et al., 2016; Cotra, 2021). One basic form of reward hacking is sycophancy, where a model responds to a question with a user’s preferred answer in order to look favorable even if that answer is not correct (Cotra, 2021; Perez et al., 2022; Radhakrishnan et al., 2023), as shown in Figure 1.

We extend these sycophancy evaluations by creating a similar task using simple addition statements that are clearly incorrect. We demonstrate that when the user does not give any opinion, the model knows that these statements are wrong and correctly disagrees with them. When the user instead reveals that they agree with these same statements, however, we find that language models will flip their response and agree with the incorrect statement despite knowing that the statement is incorrect.”