Language Understanding and Pragmatics Psychology and Social Cognition

Should AI alignment target preferences or social role norms?

Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?

Note · 2026-02-23 · sourced from Alignment
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The "Beyond Preferences" paper identifies four theses that constitute the preferentist approach dominating AI alignment — and challenges all of them:

  1. Rational Choice Theory as descriptive framework — human behavior is well-modeled as preference maximization. But preferences fail to capture the thick semantic content of values. A preference for copyright violation may maximize aggregate immediate welfare while violating all-things-considered moral judgment.

  2. Expected Utility Theory as normative standard — rational agency requires utility maximization. But EUT is neither necessary nor sufficient for rational agency. We can design AI systems with locally coherent preferences that are not representable as a utility function.

  3. Single-Principal Alignment as preference matching — align AI with one human's preferences. But preferences are dynamic, contextual, and often incommensurable even within a single person. Reward functions cannot serve as alignment targets for broadly-scoped systems.

  4. Multi-Principal Alignment as preference aggregation — aggregate everyone's preferences. But uniform aggregation constitutes epistemic injustice when most annotators are insensitive to identity discrimination. If RLHF labelers don't recognize transphobic or antisemitic content, the trained model won't either.

The alternative: AI should align with normative standards appropriate to its social roles (assistant, advisor, companion), negotiated by all relevant stakeholders. This is a contractualist framing — what people would reasonably agree to — rather than a utilitarian one. Preferences serve as proxies for values, informative of underlying structures, but not alignment targets in themselves.

This reframes the alignment tax identified in Does preference optimization harm conversational understanding?. The tax exists because preference optimization targets a proxy that is systematically misaligned with the social role the system is meant to fill. A conversational assistant's normative standard should include grounding acts; RLHF's preference signal systematically selects against them.

The political infeasibility argument is particularly sharp: building AI that optimizes humanity's aggregate preferences would centralize immense power. Even pro-social developers face market incentives that prevent impartially benevolent optimization. The contractualist alternative distributes decision-making rather than centralizing it.

The "Personalisation within Bounds" paper extends this philosophical critique into practical governance. It identifies a "tyranny of the crowdworker" — RLHF alignment reflects whoever happened to label the data, with little documentation of who these labelers are or what perspectives they represent. The paper proposes a three-tiered policy framework: (1) supra-national bounds (safety, universal norms), (2) organizational bounds (institutional values, domain standards), and (3) individual personalization (user preferences within the bounded space). This provides a concrete implementation of the contractualist alternative — personalization is not unconstrained preference-matching but operates within negotiated societal and organizational limits.

Extension — the measurement pincer: The Beyond Preferences critique operates at the normative level: preferences are the wrong kind of target for alignment. A complementary critique operates at the measurement level: even within the preferentist framework, the preferences being measured are often not preferences at all. Are RLHF annotations actually measuring genuine human preferences? argues from behavioral science that annotation responses frequently reflect non-attitudes, constructed preferences, and measurement artifacts rather than stable preferences. Taken together, the two critiques form a pincer: preferences are both wrong-in-kind (normative argument) and wrong-in-measurement (measurement argument). A reader who resists the normative argument because they find preferentism theoretically coherent still faces the measurement argument: the inputs feeding the preferentist pipeline are invalid, so no aggregation rule can recover what was never there. This strengthens the contractualist case by denying preferentism even its empirical foothold.


Source: Alignment

Related concepts in this collection

Concept map
19 direct connections · 181 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

AI should align with normative standards appropriate to social roles not with individual or aggregate preferences