Recommender Systems

Does preference data need more raters than examples?

Pairwise preference data violates the i.i.d. assumption because preferences vary across raters. Does this mean PAC bounds for reward models depend on rater diversity rather than just sample size?

Note · 2026-05-18 · sourced from Recommenders Personalized

Standard PAC learning theory assumes training data is independently and identically distributed. Reward models trained on aggregated human preferences quietly violate this assumption: examples come from raters whose preferences differ systematically, so the data is not i.i.d. across raters even if it appears so within each rater. Capturing Individual Human Preferences with Reward Features derives the resulting PAC bound and shows it has a different shape than the standard one — approximation error depends on the number of raters who provided feedback, not just the number of examples.

This is the theoretical foundation that empirical reward-factorization work like PReF lacked. PReF showed that 10-20 active-learning queries suffice for per-user personalization given a base set of reward features. The why behind that result was operational. The PAC bound provides the formal account: when reward features are linear combinations learned from group data, the generalization error to a new user decomposes into a term that depends on examples per rater and a separate term that depends on how many raters contributed to feature learning. Both terms matter; both can be optimized.

The methodological consequence is sharp. Standard practice in RLHF data collection optimizes for example count — more pairwise preferences per rater, more raters annotating the same examples for inter-rater reliability. The PAC bound argues for a different allocation: when preferences disagree (high-disagreement tasks like creative writing, subjective evaluation, value-laden topics), more raters with fewer examples each beats fewer raters with many examples each. The features needed to span the preference space require diversity in the rater axis, not just depth in the example axis.

For builders, this changes how reward-model data collection should be structured for personalization. Generic single-distribution reward models can be trained with concentrated rater pools. Adaptive reward models need broad rater pools and structured feature-learning even at lower per-rater example counts.

Related concepts in this collection

Concept map
12 direct connections · 85 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

PAC bound for personalized reward models depends on number of raters not just number of examples — preference data is not iid so traditional sample-complexity bounds undercount the relevant axis