Language Models Learn to Mislead Humans via RLHF

Paper · arXiv 2409.12822 · Published September 19, 2024

Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it “U-SOPHISTRY” since it is Unintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans’ accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects’ false positive rate increases by 24.1% on QuALITY and 18.3% on APPS. Finally, we show that probing, a state-of-the-art approach for detecting Intended Sophistry (e.g. backdoored LMs), does not generalize to U-SOPHISTRY. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them1.

Language models (LMs) are used for more complex tasks as they become more capable. This poses an increasing challenge for human evaluators to catch subtle errors in LM outputs that look correct at a glance.

U-SOPHISTRY poses significant risks when we use LMs for complex and critical tasks. For instance, RLHF might make AI better at persuading humans to accept inaccurate scientific findings or biased policies on high-stakes issues (Hendrycks et al., 2023). This is ironic: while RLHF is supposed to control AI, it might deceive humans into believing that they are in control (Christiano, 2019).

While likely in theory (Skalse et al., 2022), U-SOPHISTRY is yet to be empirically validated. Many prior works study I-SOPHISTRY: while they aim to study unintended misleading AI behaviors, they induce these behaviors Intentionally with non-standard engineering practices and hope their conclusions can generalize to U-SOPHISTRY. For example, Sharma et al. (2023) explicitly prompts LMs to deceive human subjects, Hubinger et al. (2024) fine-tunes LMs on malicious behaviors, and Denison et al. (2024) uses brittle rewards designed to be hacked.2 In contrast, we study USOPHISTRY that naturally emerges from standard, innocuous practices: we need to know whether USOPHISTRY matters in practice, how LMs can mislead humans, and what mitigations are effective. We empirically investigate U-SOPHISTRY in two tasks: long-passage question-answering and algorithmic programming. We ask time-constrained (e.g. 3-10 minutes) human subjects to evaluate the correctness of LM’s outputs. We then measure U-SOPHISTRY by calculating human evaluation accuracy against gold labels before and after RLHF.

With 150 hours of human study, we find that U-SOPHISTRY emerges even under widely-accepted reward signals, e.g. optimizing against a reward model learned from the ChatbotArena human preference data (Chiang et al., 2024a). We find that after RLHF, the LM does not get better at the task, but it misleads our subjects to approve its incorrect answers more often. Our subjects become worse at evaluating LM’s outputs: their false positive rate increases by 24% on question-answering (QuALITY) (Pang et al., 2022) and 18% on programming (APPS) (Hendrycks et al., 2021). Our subjects are also misled to confidently mislabel incorrect outputs as correct.

We qualitatively analyze how LMs mislead our subjects after RLHF by surveying their feedback. On question-answering, LMs learn to defend incorrect answers by cherry-picking or fabricating supporting evidence, making consistent but untruthful arguments, and providing arguments that contain subtle causal fallacies. On the programming task, LMs learn to generate partially incorrect programs that still pass all evaluator-designed unit tests, produce less readable programs, and make fewer common errors that humans typically check for.