RewardBench: Evaluating Reward Models for Language Modeling
To enhance scientific understanding of reward models, we present REWARDBENCH, a benchmark dataset and code-base for evaluation. The REWARDBENCH dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the REWARDBENCH leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO).
we curate data to create structured comparisons across a variety of reward model properties. Each sample is formatted as a prompt with a human-verified chosen and rejected completion. We design subsets so as to vary in difficulty and coverage. Some subsets are solved by small RMs, reaching 100% accuracy, but others are harder to differentiate and still have state-of-the-art performance around 75%, with many models around the random baseline.