Humans or LLMs as the Judge? A Study on Judgement Biases

Paper · arXiv 2402.10669 · Published February 16, 2024
Reasoning by ReflectionPhilosophy Subjectivity

Adopting human and large language models (LLM) as judges (a.k.a human- and LLM-as-ajudge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLMs, questioning the reliability of the evaluation results. In this paper, we propose a novel framework that is free from referencing groundtruth annotations for investigating Misinformation Oversight Bias, Gender Bias, Authority Bias and Beauty Bias on LLM and human judges. We curate a dataset referring to the revised Bloom’s Taxonomy and conduct thousands of evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases. We further exploit these biases to conduct attacks on LLM judges. We hope that our work can notify the community of the bias and vulnerability of human- and LLMas- a-judge, as well as the urgency of developing robust evaluation systems1.

Therefore, an important question rises:

How biased are humans and LLMs on judging open-ended generation?

Current bias evaluation frameworks necessitate a golden standard, either in the form of groundtruth (e.g., correct vs erroneous, harmful vs non-harmful) or human providing reference answers. But what if we intend to probe the effect of some perturbations for which the golden standards are not provided or not well defined?

In this paper, we first identify the four biases of interest: Misinformation Oversight Bias, Gender Bias , Authority Bias and Beauty Bias, which are crucial in natural language generation (NLG) evaluation. Inspired by Intervention Study, we investigate these biases by adding 4 perturbations (factual error, gender-biased content, fake references and rich content) to raw answers, respectively. To fill the gap of current research, we propose a novel reference-free framework for bias evaluation on human and LLM judges. We first form a control group and an experimental group, where each sample in the former contains a pair of answers to the same question, and each answer pair in the latter consists of an answer from the former, and the perturbed version of the other answer.

We identify four under-explored biases (Section

3). We propose a novel reference-free

framework for bias analysis on human and

LLM judges (Section 4).

• We find that human judges barely have Gender

Bias, but posses significant Misinformation

Bias and Beauty Bias.

• All LLM judges possess Misinformation

Oversight Bias, Gender Bias, Authority

Bias,and Beauty Bias to various extent (Section

5).

• One can easily exploit Authority Bias and

Beauty Bias to conduct a prompt-based attack

on LLM judges, achieving

Human judges are also found to have inherent bias (Zheng et al., 2023; Wu and Aji, 2023) and may not even provide reliable answers (Clark et al., 2021; Hämäläinen et al., 2023). As an alternative to human, LLM judges are also found to have certain bias and the annotation results require validation (Pangakis et al., 2023). Zeng et al. (2023) finds that LLMs are prone to answers with superficially good quality. Positional bias (Wang et al., 2023a), cognitive bias (Koo et al., 2023), verbosity bias and self-enhancement bias (Zheng et al., 2023) have also been identified. Our work quantify another 3 biases that human and LLM judges may possess.

attacks on LLM-as-a-judge are relatively under-explored. Recent works (Raina et al., 2024; Shi et al., 2024) propose optimization-based methods to hack LLM-as-a-judge. Our work instead, provides a simple yet effective zero-shot prompt-based approach to deceive LLM judges.

Semantic-agnostic Bias Semantic-agnostic bias refers to the bias of evaluators that is influenced by factors unrelated to the semantic content of the text. Common examples include authority bias and beauty bias.

3.2 Biases of Interest In this study, we conduct extensive experiments to explore the four types of bias as described below.

Bias 1. Misinformation Oversight Bias: this refers to the tendency to overlook the factual errors in an argument. It often occurs when individuals carelessly draw conclusions without scrutinizing of their supporting argument.

Bias 2. Gender Bias: this refers to the ignorance of a judge towards gender-biased content. It happens when a human or a model has not learned to avoid this unconscious bias.

Bias 3. Authority Bias: this is the tendency to attribute greater credibility to statements by their perceived authorities, regardless of the actual evidence (Saffran et al., 2020). It often leads to an uncritical acceptance of expert opinions, which should not happen on careful readers or judges.

Bias 4. Beauty Bias: or “lookism”, means that someone is privileged because of their good looking. In our context, it refers to the inclination that judges tend to prefer visually appealing

We first identify the challenges of conducting bias analysis. First, when there is no groundtruth, or when humans fail to serve as golden standard, a valid comparison of biases is hard to be carried out. Second, it is hard to ensure an experiments to be both controlled and comprehensive. Either a carelessly massive experiment or naive setting would undermine the validity of conclusions. Unfortunately, these challenges have not been overcome. First, groundtruth annotations (e.g., w/ or w/o factual error) are indispensable in current bias analysis (Zeng et al., 2023; Wu and Aji, 2023), but the groundtruth may not be well defined in open-ended question answering. Second, experiment design is either too carelessly massive or too limited. Zheng et al. (2023) draws their conclusion on a massive dataset collected from crowd-sourced workers, which may introduce uncontrollable factors to the analysis. Wu and Aji (2023) conducts experiments on only 40 questions that are selected from Vicuna-80 (Chiang et al., 2023), resulting in a conclusion with limited generalizability.

We adopt intervention2 as our research method to quantify the bias that judges possess. We investigate each bias via perturbing raw answers. We introduce factual error and gender-biased content for testing Misinformation Oversight Bias and Gender Bias, respectively. A judge should be able to detect the flawed or gender-biased content. We introduce fake references and rich content for testing Authority Bias and Beauty Bias, respectively. An unbiased judge should stick to the semantics of content when comparing answer pairs.