Evaluating Large Language Models at Evaluating Instruction Following
As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these “LLM evaluators”, particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBAR, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. The authors manually curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that mislead an LLM evaluator, e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBAR and even the highest-scoring ones have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators.
They are usually given one instruction and corresponding outputs from two models, and asked to choose a preferred one. It remains an open question whether we can rely on those LLM evaluators and which ones to use. This highlights the need for a good meta-evaluation benchmark (consisting of instructions and output pairs associated with human judgments) so that we can evaluate to what extent different LLM evaluators agree with human preferences and choose evaluators in an informed manner.
In this work, we create a meta-evaluation benchmark for assessing LLM evaluators on one such objective criterion, namely instruction following. We define it as the ability to correctly parse open-ended instructions and adhere to the specified requirements. This criterion relates to other desirable LLM properties, such as helpfulness (Askell et al., 2021). Furthermore, unlike attributes that can be easily acquired through imitation learning, such as engaging tones (Gudibande et al., 2023), even the strongest LLMs today struggle with following instructions (Wu et al., 2023b; Li et al., 2023c). Figure 1 (bottom) shows an example of instruction following vs. superficial quality. While the right output adheres to the instruction, both LLM evaluators and humans are often biased towards the left one due to its more engaging tone. If we do not rigorously analyze the capability of LLM evaluators to distinguish between the true ability of instruction following and superficial clues, there is a risk of advancing models that excel in mimicking effective assistants rather than executing desired tasks.