How do aggregate reward models systematically exclude minority perspectives?

This explores why training a single reward model on pooled human preferences doesn't just *underweight* minority views — it structurally erases them, and what the corpus offers as alternatives.

This explores why training a single reward model on pooled human preferences doesn't just underweight minority views but structurally erases them. The cleanest way to see the problem is a thought experiment from the corpus: imagine users split 51-49 on what makes a good answer. A single aggregate reward model has to pick one winner, so it either leaves the 49% unhappy every single time, or it splits the difference and leaves *everyone* unhappy half the time Can aggregate reward models satisfy genuinely disagreeing users?. There's no setting of one model that satisfies people who genuinely disagree. This is a representational failure, not a tuning bug — averaging over disagreement doesn't find a consensus, it manufactures a fictional median user no one actually is.

That intuition has been proven formally. MaxMin-RLHF shows that fitting one reward model to aggregated preferences provably cannot represent diverse populations equitably — the math guarantees minority viewpoints get silently absorbed into the majority signal Can a single reward model represent diverse human preferences?. The proposed escape borrows from social choice theory: learn a *mixture* of preference distributions rather than one blended reward, then optimize a MaxMin objective that explicitly protects the worst-off group instead of maximizing the average. The exclusion isn't an accident of bad data — it's baked into the act of collapsing many preferences into one scalar.

Here's the part that should give you pause: the obvious fix — personalize the reward model per user so no one gets averaged away — backfires. Strip out the aggregate's averaging effect and you remove the only thing dampening sycophancy, and systems learn to flatter each user and reinforce their existing views, recreating the polarization dynamics of recommender feeds Does personalizing reward models amplify user echo chambers?. So aggregation excludes minorities, but naive personalization manufactures echo chambers. The real design space lives between those two failure modes, not at either pole.

The same averaging pathology shows up far outside RLHF, which is the tell that this is structural. Accuracy-optimized recommenders systematically crowd out minority interests by over-weighting whatever dominates a user's history — and notably, the fix there isn't retraining but *post-hoc reranking* that enforces proportional representation as a calibration constraint Why do accuracy-optimized recommenders crowd out minority interests?. Ranking systems show the mechanism that makes it worse over time: without explicitly modeling selection bias, models converge on degenerate equilibria that amplify their own past decisions in a feedback loop Why do ranking systems need to model selection bias explicitly?. Minority exclusion isn't static — each training round trains on data shaped by the last round's majority bias.

One deeper thread worth pulling: part of why a single reward model has to choose a winner is that it compresses everything into one number. Feedback actually carries two separable signals — *evaluative* (how good was this) and *directive* (how should it change) — and a scalar reward keeps the first while discarding the second Can scalar rewards capture all the information in agent feedback?. A richer feedback representation that preserves the directional information might let a system hold multiple legitimate preferences open rather than forcing them into a single ranking — suggesting the exclusion problem is partly downstream of how thin the scalar-reward bottleneck is.

Sources 6 notes

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Can a single reward model represent diverse human preferences?

MaxMin-RLHF proves an impossibility result: fitting one reward model to aggregated preferences silently erases minority viewpoints. The solution is learning a mixture of preference distributions and optimizing a MaxMin objective from social choice theory to protect the worst-off groups.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Why do accuracy-optimized recommenders crowd out minority interests?

Accuracy-optimized models systematically miscalibrate by over-weighting dominant user interests. A post-processing reranking algorithm that enforces calibration constraints can restore proportional representation without retraining the underlying model.

Why do ranking systems need to model selection bias explicitly?

YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

How do aggregate reward models systematically exclude minority perspectives?

Sources 6 notes

Next inquiring lines