Unintended Impacts of LLM Alignment on Global Representation

Paper · arXiv 2402.15018 · Published February 22, 2024
Personalized Assistants

Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning.

Introduction. Recently, LLM-powered chat assistants (OpenAI, 2023a; Touvron et al., 2023; Tunstall et al., 2023b) have exploded in popularity. As of December 2023, ChatGPT has amassed over 100M weekly users (OpenAI, 2023b) and Llama-Chat-7B is downloaded almost one million times a month from HuggingFace1. The success of these chat models is dependent on "alignment", which takes a base model with a language modeling objective and produces an instruction following preference-guided model to better serve user interests. Practitioners use algorithms such as RLHF (Ouyang et al., 2022) and DPO (Rafailov et al., 2023) to optimize models for attributes such as helpfulness and harmlessness and give them their chat assistant persona (Ouyang et al., 2022; Bai et al., 2022; Zhu et al., 2023). Unlike the nebulous pre-training process, which is largely defined by the distribution of data online (Raffel et al., 2019; Gao et al., 2020; Computer, 2023), model developers have a high degree of control for the key alignment variables. Who will give feedback? What prompts/tasks are in-domain?

Discussion / Conclusion. Our findings underscore three key recommendations for practitioners aligning LLMs. The Alignment of Language Models is not a One-Size-Fits-All Solution. Various groups are impacted differently by the alignment procedure. Transparency is of the utmost importance in disclosing the design decisions that go into aligning an LLM. Each step of alignment adds additional complexities and impacts on end users. As such, transparent reporting (Mitchell et al., 2019; Bommasani et al., 2023; Longpre et al., 2023; Liesenfeld et al., 2023; Gilbert et al., 2023) ideally should encompass the entire alignment pipeline, not just the final model. The InstructGPT paper (Ouyang et al., 2022) reports the demographics of their preference annotators, but most human-written preference datasets since then have not. Reporting such information, along with decisions about what prompts or tasks are in the domain, is essential for the responsible dissemination of aligned LLMs to a diverse audience of users (Sorensen et al., 2024). Slightly Multilingual SFT Data can have an Outsized Impact.