Measuring Human Preferences in RLHF is a Social Science Problem

Paper · arXiv 2604.03238 · Published January 31, 2026
AlignmentEvaluationsReinforcement LearningSocial Theory Society

RLHF assumes that annotation responses reflect genuine human preferences. We argue this assumption warrants systematic examination, and that behavioral science offers frameworks that bring clarity to when it holds and when it breaks down. Behavioral scientists have documented for sixty years that people routinely produce responses without holding genuine opinions, construct preferences on the spot based on contextual cues, and interpret identical questions differently. These phenomena are pervasive for precisely the value-laden judgments that matter most for alignment, yet this literature has not yet been systematically integrated into ML practice. We argue that the ML community must treat measurement validity as logically prior to preference aggregation. Specifically, we contend that measuring human preferences in RLHF is a social science problem. We present a taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each. This framework has two important implications. First, it raises the question of whether current RLHF practice may be systematically modeling noise as signal and elicitation artifacts as human values. Second, it provides a path forward by suggesting diagnostic tools that can distinguish valid preferences from artifacts before they enter the training pipeline.

The field has invested enormous effort in the final links of this chain—better reward modeling, better aggregation, better fine-tuning algorithms—while a logically prior question has received less systematic attention: Do annotation responses reflect genuine preferences at all?

In this paper, we argue that they often may not, and that examining this question has important implications for how the field approaches human feedback. In fact, measuring human preferences in RLHF is fundamentally a social science problem, one that requires importing frameworks the ML community has largely ignored. Behavioral scientists have studied the validity of elicited preferences for over sixty years, and consistently find that humans routinely produce answers without holding genuine opinions—a phenomenon called non-attitudes (Converse, 1964; Krosnick, 1991). Preferences are often constructed on the spot, influenced by framing and context rather than retrieved from stable mental representations (Slovic, 1995; Payne et al., 1993). The same question can measure different constructs for different people (Vandenberg & Lance, 2000). These are pervasive features of human response to complex and value-laden questions, precisely the questions that matter most for alignment.

Current RLHF practice has not yet systematically accounted for these phenomena. Reward models are trained to predict the majority label, high-disagreement items are filtered or downweighted, and the resulting scalar reward discards information about whether judgments were contested

Reward models are trained to predict the majority label, high-disagreement items are filtered or downweighted, and the resulting scalar reward discards information about whether judgments were contested.

Before asking how to aggregate diverse preferences, the field must ask whether the responses being aggregated are preferences at all. Before personalizing reward models to individual annotators, the field must ask whether those annotators have stable preferences to personalize. Before filtering highdisagreement items as noise, the field must ask whether disagreement signals contested values or the absence of values altogether.

We make three core claims. First, not all annotation responses are preferences. Responses can reflect non-attitudes, constructed preferences, or measurement artifacts, each requiring fundamentally different treatment. Second, validity can be diagnosed through consistency: genuine preferences manifest stably across equivalent measurement conditions, while artifacts do not. Third, the field’s current priorities may be inverted in optimizing downstream algorithms while neglecting the validity of inputs they depend on. We are not arguing that annotation data is useless or that RLHF should be abandoned. As such, assessing measurement validity should precede, or at minimum accompany, efforts to improve what is done with that data.