Can a single reward model represent diverse human preferences?

Standard RLHF assumes one shared preference signal. But what happens when human values genuinely conflict? This question explores whether aggregating preferences into one model fundamentally fails at fairness.

Synthesis note · 2026-06-03 · sourced from Reinforcement Learning

Standard RLHF fits a single reward model to aggregated preference data, which silently assumes "one truth" in the crowd. MaxMin-RLHF first proves this assumption fails: there is an impossibility result for single-reward alignment representing diverse human preferences. It then offers an equitable alternative — learn a mixture of preference distributions via expectation-maximization, and optimize a MaxMin policy objective inspired by the Egalitarian principle from social choice theory (maximize the welfare of the worst-off preference group). The authors connect this to distributionally robust optimization and general-utility RL, showing the approach is principled, not ad hoc, with gains on GPT-2 and Tulu2-7B.

The keeper is the move from aggregation to social choice: once you accept preferences are genuinely plural, the question is not "what does the average annotator want?" but "what allocation is fair across groups?" — and MaxMin/Egalitarian is one defensible answer that, by construction, refuses to sacrifice minorities to the majority.

This sharpens the vault's diverse-preferences cluster with a social-choice lens. It states formally what Can aggregate reward models satisfy genuinely disagreeing users? argues structurally, and it is a policy-objective sibling to the latent-variable approach of Do unimodal reward models actually serve all user preferences? (VPL) — MaxMin changes the optimization target, VPL changes the reward model. Both presuppose Are RLHF annotations actually measuring genuine human preferences?.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 76 in 2-hop network ·medium cluster Open in graph ↗

Can a single reward model represent diverse huma… Can aggregate reward models satisfy genuinely disa… Do unimodal reward models actually serve all user … Are RLHF annotations actually measuring genuine hu…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can aggregate reward models satisfy genuinely disagreeing users? When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?
MaxMin proves formally what this note argues structurally
Do unimodal reward models actually serve all user preferences? Standard RLHF assumes a single utility function across all users, but what happens when preferences genuinely conflict? Does averaging these opposing preferences into one model systematically fail certain groups?
sibling fix via latent-variable reward modeling; MaxMin fixes the objective, VPL the reward
Are RLHF annotations actually measuring genuine human preferences? RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
both diverse-preference fixes presuppose the preferences are validly measured in the first place

Can a single reward model represent diverse human preferences?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4