Language Understanding and Pragmatics Psychology and Social Cognition

Are RLHF annotations actually measuring genuine human preferences?

RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?

Note · 2026-04-07 · sourced from Alignment
What kind of thing is an LLM really? What stops language models from improving themselves autonomously?

The RLHF research program has invested enormous effort in the final links of its chain: better reward modeling architectures, better preference aggregation rules, better fine-tuning algorithms. A logically prior question has received less systematic attention: do the annotation responses being modeled reflect genuine preferences at all? This paper argues — drawing on sixty years of behavioral science literature that the ML community has largely ignored — that they often may not, and that this measurement validity question must be answered before any aggregation or fine-tuning decision makes sense.

The behavioral science findings are well-established. Humans routinely produce answers to survey questions without holding genuine opinions, a phenomenon called non-attitudes (Converse 1964; Krosnick 1991). Preferences are often constructed on the spot, influenced by framing and context rather than retrieved from stable mental representations (Slovic 1995; Payne et al. 1993). The same question can measure different constructs for different people (Vandenberg & Lance 2000). These are not marginal effects. They are pervasive for precisely the value-laden judgments that matter most for alignment: "should the AI refuse this request," "which response is more helpful," "is this harmful." Current RLHF practice trains reward models to predict the majority label, filters or downweights high-disagreement items, and produces a scalar reward that discards information about whether judgments were contested. The result: RLHF may be "systematically modeling noise as signal and elicitation artifacts as human values."

The logical ordering matters. Before asking how to aggregate diverse preferences, the field must ask whether the responses being aggregated are preferences at all. Before personalizing reward models to individual annotators, the field must ask whether those annotators have stable preferences to personalize. Before filtering high-disagreement items as noise, the field must ask whether disagreement signals contested values, absent values, or constructed preferences that would give different answers to the same question twenty minutes later. Each of these downstream questions presumes a solved version of the measurement validity question — and that presumption is not warranted by current practice.

This provides a second-line defense against preferentism that reaches even readers who accept preferentism in principle. Should AI alignment target preferences or social role norms? argues preferences are the wrong target on normative grounds. Measuring Human Preferences argues that even within the preferentist framework, the measurement inputs are invalid — so aggregation cannot save the approach. Together they form a pincer: preferences are both wrong-in-kind and wrong-in-measurement.

The paper's constructive contribution is a research agenda: treat measurement validity as logically prior to aggregation. Diagnose non-attitudes, constructed preferences, and measurement artifacts using the consistency criterion (do responses stabilize across equivalent conditions?). Route each type to appropriate treatment rather than collapsing them into a single signal. The alternative is an RLHF pipeline that fights downstream artifacts it inherits from upstream measurement failures — which is where the field finds itself when Why do preference models favor surface features over substance? and Why do reasoning models fail at predicting disagreement? document 40%+ divergences and systematic disagreement-suppression without being able to point to the upstream cause.

The practical implication is uncomfortable. If measurement validity is suspect, then a significant portion of the alignment investment of the last several years has been optimizing the wrong objective — not because preferences are the wrong target (the Beyond Preferences critique) but because the "preferences" in the training data are not preferences.


Source: Alignment

Related concepts in this collection

Concept map
17 direct connections · 169 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

preference measurement validity is logically prior to preference aggregation — RLHF may be systematically modeling elicitation artifacts as human values