Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Paper · arXiv 2501.18099 · Published January 30, 2025

LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of human-annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs.

As large language models (LLMs) continue to improve, reliably evaluating their long-form outputs has become even more challenging. Owing to the high cost of human evaluation, the LLM-as-a-Judge paradigm has emerged as a promising alternative where LLMs themselves are employed as evaluators (Zheng et al., 2023; Kim et al., 2024a; Saha et al., 2024a; Dubois et al., 2024). LLM-as-a-Judge models also serve as reward models during training for iterative preference optimization and self-improvement (Yuan et al., 2024). Compared to traditional reward models that only output scalar scores, LLM-as-a-Judge models expend more test-time compute by generating Chain-of-Thought (CoT) rationales of the underlying reasoning process of evaluation. This has been shown to not only improve evaluation accuracy but also enhance transparency (Zheng et al., 2023; Wang et al., 2024c; Ankner et al., 2024).

Despite the promise of LLM-as-a-Judge models, the lack of human-annotated CoTs makes it difficult to train such models. Hence, a crucial step in building these judges is generating rationales by writing down detailed evaluation instructions or rubrics that LLMs can follow. These hand-crafted instructions vary for every new domain (e.g., safety versus coding) (Yu et al., 2024b) and include manually designing evaluation criteria (Zheng et al., 2023; Saha et al., 2024a; Trivedi et al., 2024; Wang et al., 2024b,c), scoring rubrics, and steps for each criterion (Yuan et al., 2024; Trivedi et al., 2024; Kim et al., 2024b; Wang et al., 2024d). This is limiting because different tasks necessitate evaluation standards or procedures tailored to each specific task. For instance, evaluating an essay requires measuring quality along multiple, potentially subjective, fine-grained criteria like relevance and clarity whereas evaluating a math problem requires objectively verifying the correctness of the solution in a step-by-step manner (Lightman et al., 2024). Simply using predefined evaluation prompts hurts evaluation accuracy, while manually adjusting the evaluation instructions is neither scalable nor realistic, given the wide range of arbitrary and complex tasks that LLM-as-a-Judge models are used for.

To overcome these limitations, we propose EvalPlanner, a novel approach to building Thinking-LLM-as-a-Judge models that teaches LLMs to both plan and reason for evaluation. EvalPlanner is trained to perform complex evaluation by thinking and spending more test-time compute with CoTs that are decoupled into a planning component and a reasoning component. In the planning component, the model generates a detailed evaluation plan, that consists of all the necessary steps to evaluate responses specific to the given instruction. In the reasoning component, the model executes the plan step-by-step and reasons through the input response(s) to arrive at the final verdict. EvalPlanner is iteratively trained in a self-improving loop (Yuan et al., 2024; Wang et al., 2024c; Wu et al., 2024a) by sampling multiple plans and plan executions from the current model and performing preference optimization over correct and incorrect CoTs, i.e., chosen and rejected (plan, execution, verdict) triples. This teaches the model to iteratively optimize for both (1) generating a good plan that may encapsulate the most relevant and fine-grained criteria, scoring rubrics, reference answers, unit tests, etc based on the input task at hand and (2) performing correct execution grounded in the generated plan. EvalPlanner achieves this learning using only synthetic data as supervision via self-training.

We conduct extensive experiments on four reward modeling benchmarks – RewardBench, RM-Bench, JudgeBench, and FollowBenchEval – spanning instructions across categories of Chat, Safety, Code, Math, and fine-grained multi-level constraints. On RewardBench, EvalPlanner achieves a new state-of-the-art score of 93.9 for generative reward models, outperforming baselines that train on up to 30x more, and typically human-annotated, data.