Recommender Systems Reasoning and Learning Architectures

Do unimodal reward models actually serve all user preferences?

Standard RLHF assumes a single utility function across all users, but what happens when preferences genuinely conflict? Does averaging these opposing preferences into one model systematically fail certain groups?

Note · 2026-05-18 · sourced from Reinforcement Learning
What actually constrains large language models from self-improvement? What actually changes inside a model during RL training?

The dominant RLHF formulation assumes all human preferences derive from a single utility function. Apply the Bradley-Terry-Luce model under this assumption, fit a reward to aggregate preference data, optimize the policy. The model class is unimodal.

This breaks when human preferences are genuinely multi-modal — when different groups of users derive opposing utilities from the same response attributes. The classic case: one group prefers detailed responses, another prefers concise ones. Maximum-likelihood estimation under unimodal BTL learns a reward function that averages these preferences. The resulting policy is optimized to a centroid that maximizes nobody's utility. Each subgroup is systematically failed.

VPL (2408.10075) treats this as a latent-variable problem. The user's preferences come from a latent context z (the user's hidden type). The reward function is conditioned on z. A variational encoder, given a few preference annotations from a user, infers a posterior over z. The reward model then makes user-specific predictions. Under variational inference, this is principled — an ELBO can be derived for latent-variable preference-based reward optimization.

Two technical considerations emerge. First, binary comparisons inherently lack information about reward scale — they constrain only the difference r_A − r_B. Different users may end up with vastly different reward magnitudes that destabilize multi-user RL. A simple pairwise classification scheme bounds and scales reward estimates within the latent variable framework. Second, the variational structure provides predictive uncertainty over the user's latent — enabling things like active learning and abstention.

The conceptual move matters beyond reward models. The unimodal assumption is doing more work than it appears. Across many RLHF deployments, "preference data" silently aggregates conflicting utilities, and the resulting policy is systematically miscalibrated for every subgroup — not just one. The averaging is not a smoothing operation that gracefully degrades; it is an actively-wrong specification that produces a policy nobody wants.

This connects directly to Can text summaries beat embeddings for personalized reward models?: PLUS replaces the embedding latent with a text summary, achieving the same conditioning effect with interpretability and portability. VPL is the variational baseline that PLUS improves upon by switching the latent representation from a vector to text.

Pluralistic alignment is not a refinement of RLHF — it is the correction of a categorically-mis-specified assumption.


Paper: Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Related concepts in this collection

Concept map
13 direct connections · 95 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

unimodal BTL reward models average across multi-modal preferences and produce policies that fail every subgroup