Self-Taught Evaluators

Paper · arXiv 2408.02666 · Published August 5, 2024

Model-based evaluation is at the heart of successful model development – as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to improve evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions.

can be costly and time-consuming to collect, as it requires expert annotation for challenging tasks (e.g., coding and mathematics). This dependency on human-generated data poses significant challenges for scaling to new tasks or evaluation criteria.

Given a seed model, our method first uses prompting to generate contrasting synthetic preference pairs for a given input, such that one response is designed to be inferior to the other. Next, using the model as an LLM-as-a-Judge, we generate reasoning traces and judgments for these pairs, which we can label as correct or not given our synthetic preference pair design. After training on this labeled data we obtain a superior LLM-as-a-Judge, from which we can then iterate the whole process in order for it to self-improve.

Synthetic data has emerged as a promising solution for efficiently acquiring training examples and can be particularly valuable in settings where real-world data can be hard to access (e.g., weather data covering all conditions (Lam et al., 2023)) or where correct annotations can be challenging to acquire (e.g., coding tasks (Liu et al., 2024)).

In order to do this it is common to output, prior to the final judgment, a chain-of-thought (or “reasoning chain”), which is a set of steps generated in natural language that helps the model decide its final judgment.

We propose a novel recipe for training such an evaluator. Our overall method is an iterative training scheme that bootstraps improvements by annotating the current model’s judgments using constructed synthetic data – so that the Self-Taught Evaluator is more performant on the next iteration. Our overall pipeline is thus as follows:

• Initialization: We assume access to a large set of human-written user instructions, e.g., of the type that is commonly collected in production systems, and an initial seed LLM.

• Instruction Selection: We next select a challenging, balanced distribution of user instructions from the uncurated set by categorizing them via LLM.

• Response Pair Construction: For each user instruction (example) we create a preference pair of two model responses (chosen & rejected), generating them via prompting such that the rejected response is likely of lower quality than the chosen response.

• Iterative Training: We then iterate the following two steps:

(i) Judgment Annotation: For each example, we sample from the current model up to N times LLM-as-a-Judge generated reasoning traces and judgments. If we find a correct judgment we add that example to our training set, otherwise we discard it.

(ii) (ii) Model Fine-tuning: We fine-tune the model on the newly constructed

Given a pool of human-written user instructions, there may be a large degree of noise, as well as an imbalance in terms of topic, variety, difficulty, and ability of the model to answer. We therefore aim to select a subset of instructions to generate high-quality synthetic responses and judgments that can be further used for training.

We classify each input using an LLM into a given category, for example coding, reasoning, brainstorming, etc.

dels that simply output a score, as LLMas- a-Judge typically first generates a reasoning chain. Further, we have used relatively large LLMs in this work (70B parameters) and made no study of whether our approach works on smaller models. Since we use a seed model to generate first synthetic preferences during our iterative training scheme, one of the assumptions is that the model is capable of generating reasonable evaluations. Thus, our approach is limited by having a capable instruction fine-tuned model which is already reasonably aligned to human (or legal/policy) preferences. Furthermore, we only investigated and reported metrics involving evaluation accuracy improvements, rather than computational requirement concerns. We also only investigated pairwise evaluation, i.e., comparing two responses, whereas it is also possible to use LLM-as-a-Judge models (or any other model) to evaluate the quality of single responses, e.g., giving them a score out of 5 or 10, rather than a pairwise A vs B judgment.