Beyond Preferences in AI Alignment

Paper · arXiv 2408.16984 · Published August 30, 2024

The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) that AI systems should be aligned with the preferences of one or more humans to ensure that they behave safely and in accordance with our values. Whether implicitly followed or explicitly endorsed, these commitments constitute what we term a preferentist approach to AI alignment. In this paper, we characterize and challenge the preferentist approach, describing conceptual and technical alternatives that are ripe for further research. We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and how utility representations neglect the possible incommensurability of those values. We then critique the normativity of expected utility theory (EUT) for humans and AI, drawing upon arguments showing how rational agents need not comply with EUT, while highlighting how EUT is silent on which preferences are normatively acceptable. Finally, we argue that these limitations motivate a reframing of the targets of AI alignment: Instead of alignment with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant. Furthermore, these standards should be negotiated and agreed upon by all relevant stakeholders. On this alternative conception of alignment, a multiplicity of AI systems will be able to serve diverse ends, aligned with normative standards that promote mutual benefit and limit harm despite our plural and divergent values

Borrowing a term from the philosophy of welfare (Baber, 2011), we identify these formulations as part of a broadly preferentist approach to AI alignment, which we characterize in terms of four theses about the role of preferences in both descriptive and normative accounts of (human-aligned) decision-making:

Rational Choice Theory as a Descriptive Framework. Human behavior and decision-making is well-modeled as approximately maximizing the satisfaction of preferences, which can be represented as a utility or reward function.

Expected Utility Theory as a Normative Standard. Rational agency can be characterized as the maximization of expected utility. Moreover, AI systems should be designed and analyzed according to this normative standard.

Single-Principal Alignment as Preference Matching. For an AI system to be aligned to a single human principal, it should act so as to maximize the satisfaction of the preferences of that human.

Multi-Principal Alignment as Preference Aggregation. For AI systems to be aligned to multiple human principals, they should act so as to maximize the satisfaction of their aggregate preferences.

Of course, preferentism in AI alignment is not without its critics. There has been considerable discussion as to whether its component theses are warranted (Shah, 2018; Eckersley, 2018; Hadfield- Menell and Hadfield, 2018; Wentworth, 2019, 2023; Gabriel, 2020; Vamplew et al., 2022; Korinek and Balwit, 2022; Garrabrant, 2022; Thornley, 2023), echoing similar debates in economics, decision theory, and philosophy. Nonetheless, it is apparent that the dominant practice of AI alignment has yet to absorb the thrust of these debates.

Section 4 we consider what this implies for aligning AI with a single human principal. Since reward functions may not capture even a single human’s values, the practice of reward learning is unsuitable beyond narrow tasks and contexts where people are willing to commensurate their values. Furthermore, since preferences are dynamic and contextual, they cannot serve as the alignment target for broadly-scoped AI systems. Rather, alignment with an individual person should be reconceived as alignment with the normative ideal of an assistant. More generally, AI systems should not be aligned with preferences, but with the normative standards appropriate to their social roles and functions (Kasirzadeh and Gabriel, 2023).

we argue that contractualist and agreement-based approaches can better handle value contestation while respecting the individuality of persons and the plurality of uses we have for AI

Despite having been originally designed for single-human contexts, in practice, RLHF is almost always applied to preference datasets collected from multiple human labelers

Consider a hypothetical example in the context of RLHF: Users are asked whether they would personally enjoy an LLM that can generate copyrighted short stories, and most of them say yes. If what we care about is aggregate (immediate) welfare, then uniform aggregation of the elicited preferences seems to achieve that goal. But if we what we care about aggregating are all-things-considered value judgments—including legal and moral considerations— then uniform aggregation no longer seems so appropriate.

Similar issues arise when trying to aggregate toxicity or harmfulness judgments across multiple humans (Bai et al., 2022a; Davani et al., 2022). In these cases, the elicited preferences are goodness of-a-kind judgments, and their connection to aggregate welfare (or all-things-considered goodness) is many steps removed. As such, uniform or majoritarian aggregation can easily fail to achieve social goals. If most human annotators are insensitive to certain forms of identity discrimination (e.g. sexually demeaning images, trans-exclusionary rhetoric, or anti-semitic tropes), then AI systems trained on such data will almost certainly cause harm (Richardson et al., 2019; Okidegbe, 2021). Uniform preference aggregation may thus constitute a form of epistemic injustice (Fricker, 2007; Symons and Alvarado, 2022; Hull, 2023), which in turn leads to downstream injustice and harm.

we might want to grant veto power to copyright holders, allowing them to reasonably reject the welfare-oriented majority preference for copying their work. This veto right could be justified as an instantiation of Scanlon’s contractualism

As for harmfulness judgments, it may often be preferable to apply prioritarian (Lumer et al., 2005; Holtug, 2017) or egalitarian (Rawls, 1971) approaches to aggregation. For example, one might select annotators who are most directly impacted by potential harms (Gordon et al., 2022), thereby prioritizing certain segments of the population. In cases of significant disagreement, one might even place all weight on the individual with the strongest dispreference (Leben, 2017; Bakker et al., 2022; Weidinger et al., 2023). Again, there are many possible justifications for such procedures. Prioritarian selection could be justified on normative grounds, or because of its epistemic benefits — after all, those most impacted by harms also tend to be more informed about their effects

Whatever procedure one favors, it is important not to confuse the aggregation rules used in AI systems with our ultimate social objectives. In practice, these aggregation rules are merely parts of the overall decision procedure implemented by (training) an AI system, and as many philosophers have pointed out, such procedures should be distinguished from standards of rightness

Unfortunately, taking aggregate preferences as an alignment target immediately runs into theoretical difficulties. While these issues have been studied at length by social choice theorists,35 one that is especially challenging for standard utilitarian aggregation is incomparability. As we noted earlier, justifications for preference aggregation typically assume that each individual’s preferences can be represented as a utility function, and furthermore that utility can be compared across persons (Harsanyi, 1953, 1975). But as we have elaborated Section 2, these assumptions are very much in doubt. Even within a single individual, preferences may be incomplete due to incomparable choices, or not clearly comparable across time

The politically infeasibility of impartially benevolent AI. Perhaps even more importantly, the project of building AI that optimizes humanity’s aggregate preferences is politically infeasible: Even if impartially benevolent AI planners were possible to develop, building such systems would be incompatible with the incentives of every AI developer with a realistic chance of doing so. This is the case even for AI developers with expressedly pro-social missions, which are still subject to market incentives as a result of the need to raise capital (Toner and McCauley, 2024), and are still governed by the laws and regulations of the countries they are based in. Allowing the creation of such AI systems would also risk the centralization of immense power:

Confusion about what reward functions represent. Alongside these limitations in expressiveness, there is often slippage among AI researchers regarding the ontological status of reward,5 which is sometimes interpreted as the intrinsic desirability of a particular state or action (Schroeder, 2004), or as a biological signal that promotes learning (Butlin, 2021) or evolutionary success (Singh et al., 2009), but is also used to define the instrumental value of a state (as in reward shaping (Ng et al., 1999; Booth et al., 2023)), or to demarcate goals (i.e. desired trajectories or states of affairs (Molinaro and Collins, 2023; Davidson et al., 2024)).

Preference is a central concept in both the theory and practice of AI alignment. Yet as we have seen, its multiple scopes and meanings are often poorly understood. In this paper, we have sought not only to better contextualize the nature of preferences, but also to challenge its centrality in approaches to AI alignment. In doing so, we hope to have established the goals of AI alignment on firmer normative ground. Crucially, we do not do so by rejecting all preference-based frameworks in alignment, but by reinterpreting what preferences do for us: Since they are constructed from our values, norms, and reasons, they are informative of those underlying structures. As such, preferences can serve as proxies for our values, but not targets of alignment in and of themselves.

What would AI alignment look like if it took these challenges seriously? It would move away from naive rational choice models of human decision making, towards richer models that include how we evaluate, commensurate, and act upon our values in boundedly rational ways. It would no longer take for granted expected utility theory, and instead explore systems for reasoning about the normativity of our preferences and values. It would learn to distinguish goodness-of-a-kind preferences from all-things-considered preferences, and identify which of those are operative in any particular decision. It would let go of preference matching as a crisp formalization of alignment, and instead lean into the normative complexity of scoping and defining AI’s social roles. And it would move beyond alignment with aggregate preferences, towards a more pluralistic and contractualist understanding of what it means to live together with AI. If successful, then perhaps the world we can look forward to is not just one we will prefer, but one that we will truly have reason to value.

in Section 3 we turn to expected utility theory (EUT) as a normative standard of rationality. Even while recognizing that humans often do not comply with this standard, alignment researchers have traditionally assumed that sufficiently advanced AI systems will do so, and hence that solutions to AI alignment must be compatible with EUT. In parallel with recent critiques of this view (Thornley, 2023, 2024; Bales, 2023; Petersen, 2023), we argue that EUT is both unnecessary and insufficient for rational agency, and hence limited as both a design strategy and analytical lens. Instead of adhering to utility theory, we can design tool-like AI systems with locally coherent preferences that are not representable as a utility function