Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Paper · arXiv 2407.19594 · Published July 28, 2024
Self Refinement Self Consistency FeedbackEvolutionReinforcement LearningReward Models

Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024c) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model’s ability to judge and follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.

Among the potential solutions to this challenge, self-judging by the AI emerges as a particularly promising approach. Yuan et al. (2024c) introduces an iterative Self-Rewarding mechanism that enables an LLM to improve autonomously. The process involves a single model that takes on two distinct roles, as an actor and as a judge. As an actor, the model produces responses that are aimed to fulfill specific instructions. As a judge (a special kind of acting), the model evaluates these responses via LLM-as-a-Judge prompting (Zheng et al., 2024) and assigns rewards. The objective of the actor during this self-play is to maximize its reward, thereby improving its ability to follow instructions.

We hypothesize that a major limitation of this previous work is that its learning objective enhances the model’s ability as an actor to generate better responses, while overlooking improving the model’s ability as a judge. If the ability to judge does not improve then training the actor over iterations can quickly saturate – or worse could overfit the reward signal, a.k.a. reward hacking. Consequently, it is imperative to also improve the model’s capabilities as a judge in addition to its ability to act.

In this paper, we propose a novel method called Meta-Rewarding which assigns rewards to its own judgements to train the model’s ability to judge. The key idea is to introduce a third role of metajudge, whose task is to evaluate the model’s own judgements. While the judge evaluates the actor’s responses, the meta-judge evaluates the judge’s judgments (including rewards that it assigns) using a mechanism similar to LLM-as-a-Judge, which we term LLM-as-a-Meta-Judge.

In our method, we assume a setup where we only have an initial seed model, an instruction-tuned LLM, and no further human supervised training data. The idea is to generate training data from the model itself through an iterative self-play process. In this process, the model assumes three main roles: as an actor, it generates responses to given prompts; as a judge, it evaluates and scores its own responses; and as a meta-judge, it compares the quality of its own judgments.

While training the actor to generate better responses to user queries is the final objective, this training’s efficacy relies on the accuracy of the judge. As the judge’s accuracy increases, it will provide higher quality feedback for training the actor, ultimately leading to a better actor. Therefore, the goal of Meta-Rewarding is to improve the model’s capability both as actor and judge during training. The role of the meta-judge is to provide feedback necessary for training the judge.

Once we have the preference data both for the actor and the judge, then we apply preference optimization on the dataset via DPO (Rafailov et al., 2024). Note that while other RLHF methods can be employed, we chose to use DPO because of its simplicity and stability.

To prepare effective training data for the judge, we focus on responses where

the judge is the least certain, as measured by the variance of the scores it has given.

This LLM-as-a-Meta-Judge prompt includes the original prompt x, response y, and its two judgements (jm, jn) as well as the rubric used by the judge. Then the model is asked to generate chain-of-thought reasoning followed by its choice of the better judgement. Again this uses the same LLM model, but acting as a meta-judge this time.

However, this leads to length explosion where responses get longer with each iteration. This is due to the length-bias of the judge, a well-know issue in reward models (Dubois et al., 2024a; Park et al., 2024; Yuan et al., 2024b). To mitigate this, we introduce a simple length-control mechanism.