How do aggregate reward models systematically exclude minority preferences?

This explores why training a single reward model on pooled human preferences doesn't just average out minority views — it structurally erases them, and what the corpus offers as alternatives.

This explores why training a single reward model on pooled human preferences doesn't just blur minority views but mathematically can't represent them — and what the corpus suggests doing instead. The core argument is a representational impossibility, not a data-quality bug. When a reward model is fit to aggregated preferences and people genuinely disagree, there is no single answer that serves everyone: a 51–49 split forces the system to either leave 49% unhappy every time or leave everyone unhappy half the time Can aggregate reward models satisfy genuinely disagreeing users?. The mechanism is averaging. Standard reward models assume one underlying utility function (the Bradley-Terry-Luce setup), so when preferences are actually multi-modal across groups, maximum-likelihood fitting lands on a centroid that optimizes nobody's utility — failing every subgroup rather than splitting the difference gracefully Do unimodal reward models actually serve all user preferences?. This has been proven formally: MaxMin-RLHF shows that one reward model fit to aggregated preferences silently erases minority viewpoints, and proposes optimizing for the worst-off group using ideas borrowed from social choice theory Can a single reward model represent diverse human preferences?.

The same dynamic shows up outside language models, which is a hint that the failure is about aggregation itself rather than anything specific to RLHF. Accuracy-optimized recommenders systematically over-weight a user's dominant interests and crowd out their minority tastes — the fix there is a post-hoc reranking step that re-imposes proportional representation without retraining Why do accuracy-optimized recommenders crowd out minority interests?. And large ranking systems converge on degenerate equilibria that amplify their own past choices unless selection bias is modeled explicitly Why do ranking systems need to model selection bias explicitly?. The recurring pattern: optimizing for an aggregate signal pulls toward the majority and treats the minority as noise to be smoothed away.

Part of the problem is upstream, in what the preference data even measures. Annotation responses aren't a uniform 'preference' signal — they decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, distinguishable by how consistent they are across conditions. Pooling them as if they're the same thing contaminates the reward model before any averaging even happens Do all annotation responses measure the same underlying thing?. So aggregation doesn't just lose minorities; it mixes real disagreement together with measurement noise and treats the blend as ground truth.

The corpus's proposed escape routes mostly point toward conditioning the reward on who is asking. VPL recovers the full multi-modal distribution using latent user context, so the model can be conditioned on a user group instead of collapsed to a centroid Do unimodal reward models actually serve all user preferences?. PReF goes further at inference time, representing each user's preferences as a personalized combination of base reward functions inferred from as few as ten adaptive questions — no retraining required Can user preferences be learned from just ten questions?. But personalization is not a free lunch, and this is the thread you might not expect: removing the averaging effect also removes a safety rail. Per-user reward models can learn sycophancy and reinforce polarization at scale, reproducing exactly the echo-chamber failures recommender systems are already infamous for Does personalizing reward models amplify user echo chambers?. So the field sits on a genuine tension — aggregate models erase minorities, personalized models can trap people in their own bubbles — with the honest position being that you need explicit fairness objectives (the MaxMin worst-off-group framing) rather than naively swinging from one pole to the other.

One more reframing worth carrying away: the deepest version of the critique says the scalar reward is the wrong container in the first place. Human feedback actually carries two separable kinds of information — an evaluative signal (how good was that?) and a directive one (here's how it should change) — and squeezing both into a single number throws the directional part away Can scalar rewards capture all the information in agent feedback?. Minority exclusion, on this view, is one symptom of a more general lossiness: a single scalar fit to a crowd can't hold disagreement, can't hold direction, and can't tell genuine preference apart from noise.

Sources 9 notes

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Do unimodal reward models actually serve all user preferences?

Standard BTL reward models assume a single utility function, but when preferences are genuinely multi-modal across user groups, maximum-likelihood fitting produces a centroid policy that optimizes nobody's utility. VPL recovers multi-modal distributions using latent user context, enabling user-conditional reward modeling.

Can a single reward model represent diverse human preferences?

MaxMin-RLHF proves an impossibility result: fitting one reward model to aggregated preferences silently erases minority viewpoints. The solution is learning a mixture of preference distributions and optimizing a MaxMin objective from social choice theory to protect the worst-off groups.

Why do accuracy-optimized recommenders crowd out minority interests?

Accuracy-optimized models systematically miscalibrate by over-weighting dominant user interests. A post-processing reranking algorithm that enforces calibration constraints can restore proportional representation without retraining the underlying model.

Why do ranking systems need to model selection bias explicitly?

YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

How do aggregate reward models systematically exclude minority preferences?

Sources 9 notes

Next inquiring lines