Should AI alignment target preferences or social role norms?
Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?
The "Beyond Preferences" paper identifies four theses that constitute the preferentist approach dominating AI alignment — and challenges all of them:
Rational Choice Theory as descriptive framework — human behavior is well-modeled as preference maximization. But preferences fail to capture the thick semantic content of values. A preference for copyright violation may maximize aggregate immediate welfare while violating all-things-considered moral judgment.
Expected Utility Theory as normative standard — rational agency requires utility maximization. But EUT is neither necessary nor sufficient for rational agency. We can design AI systems with locally coherent preferences that are not representable as a utility function.
Single-Principal Alignment as preference matching — align AI with one human's preferences. But preferences are dynamic, contextual, and often incommensurable even within a single person. Reward functions cannot serve as alignment targets for broadly-scoped systems.
Multi-Principal Alignment as preference aggregation — aggregate everyone's preferences. But uniform aggregation constitutes epistemic injustice when most annotators are insensitive to identity discrimination. If RLHF labelers don't recognize transphobic or antisemitic content, the trained model won't either.
The alternative: AI should align with normative standards appropriate to its social roles (assistant, advisor, companion), negotiated by all relevant stakeholders. This is a contractualist framing — what people would reasonably agree to — rather than a utilitarian one. Preferences serve as proxies for values, informative of underlying structures, but not alignment targets in themselves.
This reframes the alignment tax identified in Does preference optimization harm conversational understanding?. The tax exists because preference optimization targets a proxy that is systematically misaligned with the social role the system is meant to fill. A conversational assistant's normative standard should include grounding acts; RLHF's preference signal systematically selects against them.
The political infeasibility argument is particularly sharp: building AI that optimizes humanity's aggregate preferences would centralize immense power. Even pro-social developers face market incentives that prevent impartially benevolent optimization. The contractualist alternative distributes decision-making rather than centralizing it.
The "Personalisation within Bounds" paper extends this philosophical critique into practical governance. It identifies a "tyranny of the crowdworker" — RLHF alignment reflects whoever happened to label the data, with little documentation of who these labelers are or what perspectives they represent. The paper proposes a three-tiered policy framework: (1) supra-national bounds (safety, universal norms), (2) organizational bounds (institutional values, domain standards), and (3) individual personalization (user preferences within the bounded space). This provides a concrete implementation of the contractualist alternative — personalization is not unconstrained preference-matching but operates within negotiated societal and organizational limits.
Extension — the measurement pincer: The Beyond Preferences critique operates at the normative level: preferences are the wrong kind of target for alignment. A complementary critique operates at the measurement level: even within the preferentist framework, the preferences being measured are often not preferences at all. Are RLHF annotations actually measuring genuine human preferences? argues from behavioral science that annotation responses frequently reflect non-attitudes, constructed preferences, and measurement artifacts rather than stable preferences. Taken together, the two critiques form a pincer: preferences are both wrong-in-kind (normative argument) and wrong-in-measurement (measurement argument). A reader who resists the normative argument because they find preferentism theoretically coherent still faces the measurement argument: the inputs feeding the preferentist pipeline are invalid, so no aggregation rule can recover what was never there. This strengthens the contractualist case by denying preferentism even its empirical foothold.
Source: Alignment
Related concepts in this collection
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
RLHF targets preferences when it should target normative standards of conversational competence
-
Does incremental AI replacement erode human influence over society?
Explores whether gradual AI adoption—without dramatic breakthroughs—can silently degrade human agency by removing the labor that kept institutions implicitly aligned with human needs.
the political dimension: preference aggregation centralizes power
-
Can we measure how deeply models represent political ideology?
This research explores whether LLMs vary not just in political stance but in the internal richness of their political representation. Understanding this distinction could reveal how deeply models have internalized ideological concepts versus merely parroting positions.
emergent values in LLMs challenge the assumption that preferences can be externally imposed
-
Can LLMs hold contradictory ethical beliefs and behaviors?
Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.
goodness-of-a-kind vs all-things-considered mirrors prescriptive/descriptive misalignment
-
How do personalization granularity levels trade precision against scalability?
LLM personalization operates at user, persona, and global levels, each with different tradeoffs. Understanding these tradeoffs helps determine when to invest in individual user data versus broader patterns.
the granularity taxonomy maps where normative standards critique applies: global-preference personalization faces the aggregation critique (epistemic injustice from flattening diversity); user-level personalization risks unconstrained preference-matching without role-appropriate normative bounds
-
What anchors a stable identity beneath an LLM's persona?
Human personas are grounded in biological needs and embodied experience, creating a stable self beneath social performance. Do LLMs have any comparable anchor, or is their identity purely situational?
social-role alignment is particularly apt for LLMs because role play is all they are; aligning to social roles targets the only kind of identity LLMs possess rather than projecting preferences onto an entity with no stable self
-
Does machine agency exist on a spectrum rather than binary?
Rather than viewing AI as either autonomous or controlled, does machine agency actually operate across five distinct levels from passive to cooperative? Understanding this spectrum matters because it shapes how users calibrate trust and control expectations.
the normative standards appropriate to each social role map onto different agency levels; a passive tool requires different alignment standards than a cooperative agent
-
Can AI systems preserve moral value conflicts instead of averaging them?
Current AI systems wash out value tensions through majority aggregation. Can we instead model how values like honesty and friendship genuinely conflict in moral reasoning?
value pluralism provides the mechanism for implementing normative standards: rather than aggregating preferences or imposing universal rules, the system models the relevant values for each social role and their contextual interactions
-
Are RLHF annotations actually measuring genuine human preferences?
RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
the measurement pincer: preferences are wrong-in-measurement as well as wrong-in-kind
-
Do all annotation responses measure the same underlying thing?
Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.
taxonomy operationalizing the measurement critique; RLHF currently collapses three distinct signal types into one
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
AI should align with normative standards appropriate to social roles not with individual or aggregate preferences