Capturing Individual Human Preferences with Reward Features

Paper · arXiv 2503.17338

Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users.

We need reward models that can be specialised to users. Crucially, we want the model to be able to adapt to people outside of the group who provided feedback for training. In this paper we formalise and analyse the problem of learning a reward model with this ability. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound that shows how the approximation error depends not only on the number of training examples, but also on the number of human raters who provided feedback on them. This is an interesting problem from a theoretical standpoint because the data involved is not identically and independently distributed (i.i.d.). The resulting analysis provides a formal framework to discuss strategies for pairwise preference data collection and to assess the trade-offs associated with the use of an adaptive reward model.

Our model leverages the fact that individual preferences can be captured as a linear combination of a set of general reward features. These features are unknown (in some cases to the person themself), so we need a systematic way to extract them from the data. We show how this can be accomplished using a very simple architecture and training. During the regular RLHF process, we use data coming from different individuals to learn common features that capture the preferences of the group. When the reward model is being specialised to an unknown user, the features are frozen, and only the coefficients of the linear combination must be learned. This results in a simple classification problem that can be reliably solved with a few training examples provided by the user.

As an illustration, suppose that, given a pair of alternative responses to a subjective question, 51% of the target audience prefer the first option while the remaining 49% prefer the second. If we do not distinguish between users, we are left with two options: either we pick the preferred answer and leave 49% of the users unhappy 100% of the time, or we sample the answers proportionally to how often they are preferred and leave 100% of the users unhappy approximately half of the time. Both solutions are clearly unsatisfactory. Generalising to unseen users is challenging because it requires one to capture the subjective criteria that underlie human preferences—a non-trivial endeavour, as humans often cannot fully articulate the reasons why they prefer one behaviour over the other. A possible approach is to leverage precisely the type of pairwise preference data used in RLHF, since it allows the preference criteria to be indirectly elicited rather than explicitly spelled out.

Current reward models reflect the average preferences of the target population, which excludes under-represented or "outlier" preferences. In using specialised reward models to adapt LLM outputs to individuals, we create systems that can more accurately and reliably reflect the perspectives of users who hold minority views, potentially empowering them, together with everyone else, to participate more fully in social debate. However, the personalisation of LLMs should be part of, and would directly benefit from, the wider ongoing discussion regarding the deployment of this new technology. If not implemented with ethical implications in mind, the specialisation of LLMs to user preferences may result in models behaving in undesirable ways, reinforcing existing points of view through sycophantic behaviour, contributing to the polarisation of opinions and the creation of "echo chambers."

𝑡𝑢3 = (final_inst)

Capturing Individual Human Preferences with Reward Features

Synthesis notes that discuss concepts related to this paper