Can a single reward model represent diverse human preferences?
Standard RLHF assumes one shared preference signal. But what happens when human values genuinely conflict? This question explores whether aggregating preferences into one model fundamentally fails at fairness.
Standard RLHF fits a single reward model to aggregated preference data, which silently assumes "one truth" in the crowd. MaxMin-RLHF first proves this assumption fails: there is an impossibility result for single-reward alignment representing diverse human preferences. It then offers an equitable alternative — learn a mixture of preference distributions via expectation-maximization, and optimize a MaxMin policy objective inspired by the Egalitarian principle from social choice theory (maximize the welfare of the worst-off preference group). The authors connect this to distributionally robust optimization and general-utility RL, showing the approach is principled, not ad hoc, with gains on GPT-2 and Tulu2-7B.
The keeper is the move from aggregation to social choice: once you accept preferences are genuinely plural, the question is not "what does the average annotator want?" but "what allocation is fair across groups?" — and MaxMin/Egalitarian is one defensible answer that, by construction, refuses to sacrifice minorities to the majority.
This sharpens the vault's diverse-preferences cluster with a social-choice lens. It states formally what Can aggregate reward models satisfy genuinely disagreeing users? argues structurally, and it is a policy-objective sibling to the latent-variable approach of Do unimodal reward models actually serve all user preferences? (VPL) — MaxMin changes the optimization target, VPL changes the reward model. Both presuppose Are RLHF annotations actually measuring genuine human preferences?.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does single-reward RLHF fail to represent diverse human preferences?
- How do aggregate reward models systematically exclude minority perspectives?
- What validity threats exist in crowdsourced preference signals?
- How can developers balance multiple conflicting fairness goals simultaneously?
- How do aggregate reward models systematically exclude minority preferences?
- Why does preference measurement validity matter before any aggregation?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can aggregate reward models satisfy genuinely disagreeing users?
When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?
MaxMin proves formally what this note argues structurally
-
Do unimodal reward models actually serve all user preferences?
Standard RLHF assumes a single utility function across all users, but what happens when preferences genuinely conflict? Does averaging these opposing preferences into one model systematically fail certain groups?
sibling fix via latent-variable reward modeling; MaxMin fixes the objective, VPL the reward
-
Are RLHF annotations actually measuring genuine human preferences?
RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
both diverse-preference fixes presuppose the preferences are validly measured in the first place
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- MaxMin-RLHF: Alignment with Diverse Human Preferences
- Measuring Human Preferences in RLHF is a Social Science Problem
- Capturing Individual Human Preferences with Reward Features
- Beyond Preferences in AI Alignment
- Self-Improving Model Steering
- Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
- Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
Original note title
single-reward RLHF is provably insufficient for diverse human preferences — a MaxMin egalitarian objective over a mixture of rewards is the social-choice fix