Information-Theoretic Reward Decomposition for Generalizable RLHF

Paper · arXiv 2504.06020 · Published April 8, 2025

A generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models can lack this ability, as they are typically trained by increasing the reward gap between the chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information-theoretic perspective, which requires no extra models. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize the two parts of the reward value. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for Large Language Models (LLMs) alignment [8, 5]. Within a wide range of RLHF methods, reward learning plays a pivotal role. These methods typically first train a reward model on a static dataset and then leverage it to do Reinforcement Learning (RL) [28, 11]. Compared with methods that are free of using reward models [30, 40], the advantage of such methods is their capacity to leverage the generalization capability of the reward model to evaluate outside-of-distribution prompt-response pairs. These prompt-response pairs with generated rewards can be used to further improve the LLM’s performance [36, 42].

Clearly, learning a generalizable reward model is central to this scenario. However, we found that standard reward training does not guarantee sufficient generalization capability. In reward model training, the primary goal is typically better distinguishing between chosen and rejected responses. To achieve this, the reward model does not necessarily require consideration of the corresponding prompt. Taking reward learning based on Bradley-Terry (BT) model as an example. Since the potential response space is vastly larger than the dataset size, different data samples typically contain distinct response pairs. As long as the reward gap within each response pair increases, the training loss will decrease effectively. This can occur even if the reward model only considers the responses and totally ignores the prompts. In this case, the trained reward model loses its generalization capability over different prompts and may exhibit incorrect preference for novel prompt-response pairs.

Perhaps surprisingly, such a phenomenon indeed appears in current reward models, even some that achieve SOTA performance on common benchmarks. As shown in Fig. 1 (left), after replacing the corresponding prompt with other prompts in the dataset, the reward gaps still center around their original values. This issue, where responses dominate the reward gap, does not affect training but leads to catastrophic results when evaluating novel prompt-response pairs. The illustrative example in Fig. 1 (right) shows this. When considering each prompt-response pair separately within the training dataset, its reward gap matches the ideal value. However, when querying preferences after replacing the original prompt with other prompts in the dataset (which are also meaningful queries), the reward model can yield inaccurate or even wrong preferences. This generalization issue will become more pronounced when dealing with unseen prompt-response pairs encountered during evaluation. All of these highlight the need to distinguish two components of the reward value: one part is the value determined solely by the response, and the other is the value that can only be derived by simultaneously considering both the prompt and the response. We refer to the former as prompt-free reward and the latter as prompt-related reward.

To address this, we propose a novel method of decomposition to extract these two components from an information-theoretic perspective, without requiring extra models. After that, we use the extracted prompt-free reward to guide the reward learning process, prioritizing training samples based on their prompt-free reward gap values. We verify our method through several toy examples and standard evaluations based on commonly used datasets and base models. In toy examples, the extracted prompt-free reward gaps reflect the reward model’s preference bias about response-only features, while the prompt-related reward gaps capture its generalizable preference information. Moreover, in standard experiments with common datasets, the reward model trained with our method outperforms strong baselines. These experiments show that considering both prompt-free and prompt-related rewards during training enhances the alignment performance and generalization capabilities of the reward model.