Does preference data need more raters than examples?
Pairwise preference data violates the i.i.d. assumption because preferences vary across raters. Does this mean PAC bounds for reward models depend on rater diversity rather than just sample size?
Standard PAC learning theory assumes training data is independently and identically distributed. Reward models trained on aggregated human preferences quietly violate this assumption: examples come from raters whose preferences differ systematically, so the data is not i.i.d. across raters even if it appears so within each rater. Capturing Individual Human Preferences with Reward Features derives the resulting PAC bound and shows it has a different shape than the standard one — approximation error depends on the number of raters who provided feedback, not just the number of examples.
This is the theoretical foundation that empirical reward-factorization work like PReF lacked. PReF showed that 10-20 active-learning queries suffice for per-user personalization given a base set of reward features. The why behind that result was operational. The PAC bound provides the formal account: when reward features are linear combinations learned from group data, the generalization error to a new user decomposes into a term that depends on examples per rater and a separate term that depends on how many raters contributed to feature learning. Both terms matter; both can be optimized.
The methodological consequence is sharp. Standard practice in RLHF data collection optimizes for example count — more pairwise preferences per rater, more raters annotating the same examples for inter-rater reliability. The PAC bound argues for a different allocation: when preferences disagree (high-disagreement tasks like creative writing, subjective evaluation, value-laden topics), more raters with fewer examples each beats fewer raters with many examples each. The features needed to span the preference space require diversity in the rater axis, not just depth in the example axis.
For builders, this changes how reward-model data collection should be structured for personalization. Generic single-distribution reward models can be trained with concentrated rater pools. Adaptive reward models need broad rater pools and structured feature-learning even at lower per-rater example counts.
Related concepts in this collection
-
Can user preferences be learned from just ten questions?
Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.
same conceptual framework: this note provides the theoretical PAC foundation that PReF's empirical efficiency demonstrates
-
Can aggregate reward models satisfy genuinely disagreeing users?
When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?
same paper, the consequence of treating preferences as i.i.d.
-
Can text summaries beat embeddings for personalized reward models?
When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
adjacent: a different mechanism for personalized alignment
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
PAC bound for personalized reward models depends on number of raters not just number of examples — preference data is not iid so traditional sample-complexity bounds undercount the relevant axis