Are RLHF annotations actually measuring genuine human preferences?
RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
The RLHF research program has invested enormous effort in the final links of its chain: better reward modeling architectures, better preference aggregation rules, better fine-tuning algorithms. A logically prior question has received less systematic attention: do the annotation responses being modeled reflect genuine preferences at all? This paper argues — drawing on sixty years of behavioral science literature that the ML community has largely ignored — that they often may not, and that this measurement validity question must be answered before any aggregation or fine-tuning decision makes sense.
The behavioral science findings are well-established. Humans routinely produce answers to survey questions without holding genuine opinions, a phenomenon called non-attitudes (Converse 1964; Krosnick 1991). Preferences are often constructed on the spot, influenced by framing and context rather than retrieved from stable mental representations (Slovic 1995; Payne et al. 1993). The same question can measure different constructs for different people (Vandenberg & Lance 2000). These are not marginal effects. They are pervasive for precisely the value-laden judgments that matter most for alignment: "should the AI refuse this request," "which response is more helpful," "is this harmful." Current RLHF practice trains reward models to predict the majority label, filters or downweights high-disagreement items, and produces a scalar reward that discards information about whether judgments were contested. The result: RLHF may be "systematically modeling noise as signal and elicitation artifacts as human values."
The logical ordering matters. Before asking how to aggregate diverse preferences, the field must ask whether the responses being aggregated are preferences at all. Before personalizing reward models to individual annotators, the field must ask whether those annotators have stable preferences to personalize. Before filtering high-disagreement items as noise, the field must ask whether disagreement signals contested values, absent values, or constructed preferences that would give different answers to the same question twenty minutes later. Each of these downstream questions presumes a solved version of the measurement validity question — and that presumption is not warranted by current practice.
This provides a second-line defense against preferentism that reaches even readers who accept preferentism in principle. Should AI alignment target preferences or social role norms? argues preferences are the wrong target on normative grounds. Measuring Human Preferences argues that even within the preferentist framework, the measurement inputs are invalid — so aggregation cannot save the approach. Together they form a pincer: preferences are both wrong-in-kind and wrong-in-measurement.
The paper's constructive contribution is a research agenda: treat measurement validity as logically prior to aggregation. Diagnose non-attitudes, constructed preferences, and measurement artifacts using the consistency criterion (do responses stabilize across equivalent conditions?). Route each type to appropriate treatment rather than collapsing them into a single signal. The alternative is an RLHF pipeline that fights downstream artifacts it inherits from upstream measurement failures — which is where the field finds itself when Why do preference models favor surface features over substance? and Why do reasoning models fail at predicting disagreement? document 40%+ divergences and systematic disagreement-suppression without being able to point to the upstream cause.
The practical implication is uncomfortable. If measurement validity is suspect, then a significant portion of the alignment investment of the last several years has been optimizing the wrong objective — not because preferences are the wrong target (the Beyond Preferences critique) but because the "preferences" in the training data are not preferences.
Source: Alignment
Related concepts in this collection
-
Do all annotation responses measure the same underlying thing?
Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.
the companion insight detailing the diagnostic taxonomy
-
Should AI alignment target preferences or social role norms?
Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?
the normative pincer: preferences are wrong-in-kind; this note is the measurement pincer: preferences are wrong-in-measurement
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
the 40% divergence is a downstream symptom of measurement validity failure
-
Why do reasoning models fail at predicting disagreement?
RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?
suppression of legitimate disagreement variance is the measurement failure in action
-
Can text summaries condition reward models better than embeddings?
Exploring whether learning interpretable text-based summaries of user preferences outperforms embedding vectors for training personalized reward models in language model alignment.
text-based summaries may recover context lost when scalar rewards discard disagreement signals
-
Why do LLM persona prompts produce inconsistent outputs across runs?
Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
unstable across re-asks is exactly the constructed-preference signature
-
Can models learn to ignore irrelevant prompt changes?
Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency as a diagnostic for validity maps directly to consistency as a training objective
-
Can models learn to abstain when uncertain about predictions?
Explores whether language models can be trained to recognize when they lack sufficient information to forecast conversation outcomes, rather than forcing uncertain predictions into confident-sounding responses.
abstention on uncertain outputs is the modeling-side analog of filtering non-attitudes at the input side
-
Why do LLM judges fail at predicting sparse user preferences?
When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
persona sparsity and measurement validity are adjacent: sparse sampling produces artifacts
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
complementary failure at the architectural layer: upstream measurement produces artifacts as data, architectural attention amplifies whatever is in context regardless; together they guarantee sycophancy survives any pipeline cleanup
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
preference measurement validity is logically prior to preference aggregation — RLHF may be systematically modeling elicitation artifacts as human values