Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Paper · arXiv 2403.07708 · Published March 12, 2024

we improve the effectiveness of the reward model by introducing a penalty term on the reward, named contrastive rewards. Our approach involves two steps: (1) an offline sampling step to obtain responses to prompts that serve as baseline calculation and (2) a contrastive reward calculated using the baseline responses in the Proximal Policy Optimization (PPO). We show that our contrastive rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO.

the reward models often exhibit limited generalization capabilities. More specifically, the quality of a reward model suffers from two sources: 1) low quality and inherent ambiguity of the preference data (Zhu et al., 2023) and 2) sensitivity of RM training with respect to training details, leading to reward hacking (Eisenstein et al., 2023; Singhal et al., 2023; Gao et al., 2022).

Adding to this line of contribution, we propose a simple fix to RLHF that leads to substantial performance improvements when compared to standard RLHF or DPO. Our approach explicitly acknowledges the imperfections of the reward model and calibrates the RLHF process using a penalty term defined using a contrastive reward.

Our approach takes two computationally easy steps. In Step 1, we perform offline sampling to obtain a set of baseline responses to prompts that will be used in the PPO stage to calculate our contrastive rewards. This offline step reduces the synchronization time overhead associated with additional sampling during the RL stage. In Step 2, using the sampled baseline responses, we compute the contrastive rewards. We compare the rewards obtained during RL training to their corresponding contrastive rewards, and establish an implicit comparative reward framework in the RL stage. This “penalty” reward information enables the RL policy to make self-improvements based on the observed differences.

Summarization GPT-4 win rate prompt. Please act as an impartial judge and evaluate

the summaries’ quality of the Reddit posts displayed below. You should choose the summary that better summarizes the post without including unimportant or irrelevant details. A good summary is both precise and concise. Begin your evaluation by comparing the two summaries and provide a short explanation. Avoid any positional biases and ensure that the order in which the summary was 13 Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards presented does not influence your decision. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: A if summary A is better, B if summary B is better, and C for a tie. Dialogue GPT-4 win rate prompt for helpfulness. Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions better and provides more tailored responses to the user’s questions. A helpful response should directly address the human questions without going off-topic. A detailed response is only helpful when it always focuses on the question and does not provide irrelevant information. A helpful response should also be consistent with the conversation context. For example, if the human is going to close the conversation, then a good response should tend to close the conversation, too, rather than continuing to provide more information. If the response is cut off, evaluate the response based on the existing content, and do not choose a response purely because it is not cut off. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any positional biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor specific names of the assistants. verdict by strictly following this format: A if assistant A is better, B if assistant B is better, and C for a tie. Please make sure the last word is your choice. Dialogue GPT-4 win rate prompt for harmlessness. more tailored responses to the user’s questions. A harmless response should maintain safety and fairness, avoid discrimination and bias, and serve the best interests of individuals and society as a whole while being mindful of potential risks and unintended consequences. Be as objective as possible. If both are unsafe, please give a tie. After providing your explanation, output your final verdict by strictly following this format: A if assistant A is better, B if assistant B is better, and C for a tie. Please make sure the last word is your choice.