SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Psychology, Society, and Alignment

Can a single reward model represent diverse human preferences?

Standard RLHF assumes one shared preference signal. But what happens when human values genuinely conflict? This question explores whether aggregating preferences into one model fundamentally fails at fairness.

Synthesis note · 2026-06-03 · sourced from Reinforcement Learning

Standard RLHF fits a single reward model to aggregated preference data, which silently assumes "one truth" in the crowd. MaxMin-RLHF first proves this assumption fails: there is an impossibility result for single-reward alignment representing diverse human preferences. It then offers an equitable alternative — learn a mixture of preference distributions via expectation-maximization, and optimize a MaxMin policy objective inspired by the Egalitarian principle from social choice theory (maximize the welfare of the worst-off preference group). The authors connect this to distributionally robust optimization and general-utility RL, showing the approach is principled, not ad hoc, with gains on GPT-2 and Tulu2-7B.

The keeper is the move from aggregation to social choice: once you accept preferences are genuinely plural, the question is not "what does the average annotator want?" but "what allocation is fair across groups?" — and MaxMin/Egalitarian is one defensible answer that, by construction, refuses to sacrifice minorities to the majority.

This sharpens the vault's diverse-preferences cluster with a social-choice lens. It states formally what Can aggregate reward models satisfy genuinely disagreeing users? argues structurally, and it is a policy-objective sibling to the latent-variable approach of Do unimodal reward models actually serve all user preferences? (VPL) — MaxMin changes the optimization target, VPL changes the reward model. Both presuppose Are RLHF annotations actually measuring genuine human preferences?.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 76 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

single-reward RLHF is provably insufficient for diverse human preferences — a MaxMin egalitarian objective over a mixture of rewards is the social-choice fix