Can aggregate reward models satisfy genuinely disagreeing users?
When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?
A clean argument for why aggregate reward models cannot serve disagreement-heavy tasks. Consider a subjective question where 51% of the target audience prefer answer A and 49% prefer answer B. With a single reward model trained on aggregated preferences, the deployment has exactly two options. Pick A as the preferred answer: 49% of users are unhappy 100% of the time. Sample A and B proportionally to their preference rates: 100% of users are unhappy approximately half the time. Both options are unsatisfactory.
The structural problem is that aggregate reward models compress preference distributions into single scalars (or single rankings) that cannot represent disagreement. They reward what the majority prefers and incidentally suppress what the minority prefers. For tasks with high consensus this is fine — the majority preference is everyone's preference. For tasks with genuine disagreement — subjective evaluations, value-laden topics, creative judgment, cultural-context-dependent choices — aggregate models systematically exclude the minority view.
This is not a quality problem with current reward models. It is a representational problem with the aggregation step itself. Even a perfect aggregate reward model would face this dilemma. The fix has to operate at a different level: reward models that can be specialized to individual users (or to user groups whose preferences cluster) rather than averaged across the population.
The implication extends beyond personalization. Whenever a system is deployed against a heterogeneous user base with genuinely divergent preferences, the standard "train one model to satisfy everyone" architecture is incompatible with satisfying anyone fully. The right architecture either splits per-user (personalization) or splits per-cluster (group-level adaptation). Aggregate reward modeling becomes appropriate only when the underlying preferences are actually unimodal — and that is a stronger assumption than RLHF deployments typically test.
Related concepts in this collection
-
Does preference data need more raters than examples?
Pairwise preference data violates the i.i.d. assumption because preferences vary across raters. Does this mean PAC bounds for reward models depend on rater diversity rather than just sample size?
same paper, the theoretical foundation
-
Does personalizing reward models amplify user echo chambers?
Personalized reward models solve the minority-preference problem but may introduce new risks by reinforcing existing user beliefs and narrowing exposure to diverse viewpoints.
same paper, the tension with personalization
-
Can user preferences be learned from just ten questions?
Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.
adjacent: the technical solution to the aggregation problem
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
aggregate reward models systematically exclude minority preferences — the dilemma of preferred answer or proportional sampling is a structural failure of one-size-fits-all RLHF