Do all annotation responses measure the same underlying thing?
Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.
Behavioral science's six-decade accumulation of preference elicitation research produces a taxonomy that RLHF practice collapses into a single signal. The three categories matter because they require different treatment — and treating them uniformly is the upstream mistake that Are RLHF annotations actually measuring genuine human preferences? argues contaminates the entire pipeline.
Genuine preferences manifest stably across equivalent measurement conditions. Ask the same question with different surface wording, different framing, different order, and the response stays the same. This is what the reward model is supposed to be learning. Only this category is safe to aggregate in the way standard RLHF aggregates.
Non-attitudes are responses generated to satisfy the question without any stable underlying opinion. The respondent has never formed a view on the matter, but the measurement protocol demands an answer, so one gets produced. Non-attitudes are especially pervasive for value-laden questions — precisely the questions that matter most for alignment. Non-attitudes look like genuine preferences in a single measurement but fail the consistency test: re-ask the same respondent and you get a different answer because there was never a stable view to retrieve. Current RLHF treats these as noise to filter or minority views to downweight. The behavioral science view is different: non-attitudes contain no signal at all and should be excluded, not averaged with genuine preferences.
Constructed preferences are assembled on the spot from contextual cues and framing. The respondent is not uncertain (as in a non-attitude); they are producing a coherent answer that depends on the measurement context. Change the context — different anchoring, different comparison class, different framing — and you get a different coherent answer. This category carries real information, but about the interaction between person and context, not about a stable property of the person. RLHF treats constructed preferences as context-independent preferences and trains reward models on them as if they were. The result: reward models that look good on in-distribution evaluation but fail when the deployment context differs from the annotation context.
Measurement artifacts form a fourth related category: same question measuring different constructs for different respondents. One annotator interprets "helpful" as "completes the task"; another interprets it as "gives correct information even when unasked"; a third interprets it as "avoids making the user feel incompetent." They provide coherent, stable responses — each tracking a real preference of theirs — but they are not tracking the same thing. RLHF aggregates them as if they were.
The diagnostic criterion that separates these is consistency across equivalent measurement conditions. Genuine preferences pass; non-attitudes, constructed preferences, and measurement artifacts each fail in distinctive ways. Non-attitudes fail on re-ask (no stable view). Constructed preferences fail on context perturbation (context-dependent). Measurement artifacts fail on question rephrasing (different construct elicited). These are distinguishable empirically, and the distinction determines what should be done with each.
The operational implication is a pre-aggregation filtering step that RLHF currently lacks. Before training the reward model, submit annotation tasks to consistency protocols: re-ask selected items, perturb framings, rephrase questions. Responses that fail consistency tests are not aggregated as preferences; they are either excluded (non-attitudes), contextualized (constructed preferences), or routed to separate annotators (measurement artifacts). This is operationally demanding but conceptually necessary: the alternative is the status quo, in which Why do preference models favor surface features over substance? documents 40% divergences without being able to attribute them to a specific upstream cause.
The taxonomy also suggests why Can models learn to ignore irrelevant prompt changes? works as an output-side intervention. If the upstream measurement problem is consistency failure across equivalent conditions, then training models to be invariant to equivalent-condition perturbations is a downstream patch for the same underlying phenomenon: the system's current robustness against irrelevant cue variation.
Source: Alignment
Related concepts in this collection
-
Are RLHF annotations actually measuring genuine human preferences?
RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
the parent argument this taxonomy operationalizes
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
the 40% divergence as downstream symptom; this taxonomy points upstream
-
Why do reasoning models fail at predicting disagreement?
RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?
disagreement that should be preserved vs disagreement that signals non-attitude — current RLHF conflates them
-
Can models learn to ignore irrelevant prompt changes?
Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency-as-diagnostic maps to consistency-as-training-objective
-
Why do LLM persona prompts produce inconsistent outputs across runs?
Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
unstable-across-runs is the constructed-preference signature in simulated annotators
-
Why do LLM judges fail at predicting sparse user preferences?
When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
verbal uncertainty estimation as an abstention analog for identifying non-attitudes
-
Should AI alignment target preferences or social role norms?
Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?
the normative critique; this note is the measurement refinement that specifies what the inputs actually contain
-
Can text summaries condition reward models better than embeddings?
Exploring whether learning interpretable text-based summaries of user preferences outperforms embedding vectors for training personalized reward models in language model alignment.
text summaries preserve the context that constructed preferences depend on, where scalar rewards lose it
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
annotation responses decompose into three distinct signal types — genuine preferences non-attitudes and constructed preferences — each requiring fundamentally different handling