Suppressing Pink Elephants with Direct Principle Feedback

Paper · arXiv 2402.07896 · Published February 12, 2024

Existing methods for controlling language models, such as RLHF and Constitutional AI, involve determining which LLM behaviors are desirable and training them into a language model. However, in many cases, it is desirable for LLMs to be controllable at inference time, so that they can be used in multiple contexts with diverse needs. We illustrate this with the Pink Elephant Problem: instructing an LLM to avoid discussing a certain entity (a “Pink Elephant”), and instead discuss a preferred entity (“Grey Elephant”). We apply a novel simplification of Constitutional AI, Direct Principle Feedback, which skips the ranking of responses and uses DPO directly on critiques and revisions.

that techniques like Reinforcement Learning from AI Feedback (RLAIF) (Bai et al., 2022b) can not only improve a model’s ability to remain harmless and helpful (Tunstall et al., 2023a; Ivison et al., arXiv:2402.07896v2 [cs.CL] 13 Feb 2024 2023) but can improve a model’s ability to reason (Shao et al., 2024; Luo et al., 2023a; Lightman et al., 2023) and reduce a model’s tendency to hallucinate (Tian et al., 2023). However results have shown that even with finetuning and prompting, making a model “not” discuss a topic remains a difficult and open problem (McKenzie et al., 2023; García-Ferrero et al., 2023).

Reinforcement Learning from AI Feedback, as originally presented in Bai et al. (2022b), uses a four-step process depicted in fig. 2.

Finetune a model on examples of helpful requests and outputs (blue).
Critique and revise those outputs to be more desirable, and fine-tune a new model on those outputs (orange).
Use your Supervised Fine-tuning (SFT) model to generate responses to a prompt and have a human or AI system rank those responses (green).
Feed the ranked responses into an preference learning algorithm such as PPO or DPO to

produce the final model (purple).

Previous work has sought to simplify this pipeline by excluding the Critique and Revision step and doing pairwise ranking of generations from the initial model directly (Tunstall et al., 2023a; Zhu et al., 2023). This has the advantage of requiring only a single preference collection and fine-tuning step, and is effective for improving chat quality or teaching models to behave as assistants.

We curated a dataset of 162K multi-turn conversations on the Pink Elephant Problem. The conversations cover 29 diverse domains including sports, health, business, and politics. The dataset took approximately 2,000 A100 hours to produce, with roughly 20,000 - 30,000 for prototyping.