Position: Towards Bidirectional Human-AI Alignment

Paper · arXiv 2406.09264 · Published June 13, 2024

chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://arxiv.org/pdf/2406.09264

Recent advances in general-purpose AI underscore the urgent need to align AI systems with human goals and values. Yet, the lack of a clear, shared understanding of what constitutes "alignment" limits meaningful progress and cross-disciplinary collaboration. In this position paper, we argue that the research community should explicitly define and critically reflect on "alignment" to account for the bidirectional and dynamic relationship between humans and AI. Through a systematic review of over 400 papers spanning HCI, NLP, ML, and more, we examine how alignment is currently defined and operationalized. Building on this analysis, we introduce the Bidirectional Human-AI Alignment framework, which not only incorporates traditional efforts to align AI with human values but also introduces the critical, underexplored dimension of aligning humans with AI – supporting cognitive, behavioral, and societal adaptation to rapidly advancing AI technologies. Our findings reveal significant gaps in current literature, especially in long-term interaction design, human value modeling, and mutual understanding. We conclude with three central challenges and actionable recommendations to guide future research toward more nuanced, reciprocal, and human-AI alignment approaches.

Challenge 1: Specification Gaming. AI designers often define objectives or feedback to align systems with human goals, but these rarely capture all intended values [6]. This leads to reliance on proxies like human approval [4], enabling specification gaming [7, 8], where AI makes seemingly “correct” decisions for the wrong, opaque reasons [9, 10, 11].

Challenge 2: Scalable Oversight. As AI systems become more complex—potentially reaching AGI [12] —aligning them through human feedback grows harder. Evaluating their behavior is often slow or infeasible [5], prompting research into reducing supervision burdens and enhancing human oversight, a challenge known as Scalable Oversight [13].

Challenge 3: Dynamic Nature. As AI advances, alignment must adapt to evolving human values. Without considering long-term cognitive and social impacts, AI may become neither humane nor desirable [14]. This needs a dynamic, ongoing alignment process with cross-disciplinary collaboration.

Traditionally, AI alignment has been approached as a static, one-way process focused on shaping AI to achieve desired outcomes and avoid harm [15, 1, 16]. However, this unidirectional view is increasingly insufficient as AI systems become more integrated into daily life and assume complex decision-making roles [17]. Their interactions with humans create evolving feedback loops that influence both AI behavior and human responses [18, 19, 20], highlighting the need for a more dynamic and reciprocal understanding of alignment [17].

Our findings reveal key gaps in existing research, particularly in human value modeling, oversight of model inference, critical evaluation of AI’s embedded values, and its broader societal impact. We conclude by outlining near- and long-term risks and opportunities, offering actionable recommendations to advance more reciprocal, adaptive, and nuanced approaches to human-AI alignment.

2 Defining Alignment: Fundamentals

Building on our analysis of systematic review (see details in Appendix 8), we explicitly identify the key definitions in alignment research and formally propose “Bidirectional Human-AI Alignment”. Goals. AI alignment research proposes multiple alignment goals [21, 22], such as intentions [1, 23], preferences [24, 25], instructions [26, 27], and values [28, 29]. But these terms are often used interchangeably without clear distinctions. Philosophical analysis suggests human values (moral beliefs/principles) are the most suitable alignment goal, as they ensure AI acts ethically while minimizing risks [29, 30]. Though trade-offs exist, this work adopts "human values" as the alignment objective, meaning AI should behave as individuals or society morally expect (See details in Table 2). Align with Whom. AI alignment involves multiple stakeholders, such as end users [31], AI practitioners [32, 23, 33], and organizations [34]. Many studies reference “general humans” without accounting for group differences, despite the fact that stakeholders often hold conflicting values [29].

Definition: Bidirectional Human-AI Alignment is a comprehensive framework that encompasses two interconnected alignment processes: ‘Aligning AI with Humans’ and ‘Aligning Humans with AI’. The former focuses on integrating human specifications into training, steering, and customizing AI. The latter supports human agency, empowering people to think critically when using AI, collaborate effectively with it, and adapt societal approaches to maximize its benefits for humanity.

3 Bidirectional Human-AI Alignment Framework

This section introduces the Bidirectional Human-AI Alignment framework, which encompasses two interconnected alignment directions as a feedback loop, as shown in Figure 1. The “Align AI to Humans” direction refers to mechanisms to ensure that AI systems’ values match those of humans’. The “Align Humans to AI” direction investigates human cognitive and behavioral adaptation to AI advancement. We introduce more details below.