Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge

Paper · arXiv 2412.12509 · Published December 17, 2024
EvaluationsReasoning Critiques

Large Language Models (LLMs) have become increasingly powerful and ubiquitous, but their stochastic nature poses challenges to the reliability of their outputs. While deterministic settings can improve consistency, they do not guarantee reliability, as a single sample from the model’s probability distribution can still be misleading. Building upon the concept of LLM-as-a-judge, we introduce a novel framework for rigorously evaluating the reliability of LLM judgments, leveraging McDonald’s omega. We evaluate the reliability of LLMs when judging the outputs of other LLMs on standard single-turn and multi-turn benchmarks, simultaneously investigating the impact of temperature on reliability. By analyzing these results, we demonstrate the limitations of fixed randomness and the importance of considering multiple samples, which we show has significant implications for downstream applications. Our findings highlight the need for a nuanced understanding of LLM reliability and the potential risks associated with over-reliance on single-shot evaluations.

equally important are reliability, confidence, and calibration; these three distinct concepts are intertwined and essential for building genuine trust in LLM systems

A perfectly calibrated LLM that says it’s 90% confident should be correct about 9 out of 10 times. However, even a perfectly calibrated LLM can be unreliable.

Given that a single output from an LLM is a representation of only a single draw from the model’s distribution, trustworthiness of an LLM output is of utmost importance

Many research works aim to circumvent this issue by setting a fixed seed and using deterministic settings for the temperature and top-k parameters (Ouyang et al., 2023; Wei et al., 2024; Atil et al., 2024). These studies argue that if an LLM consistently produces the same output under these conditions, it can be considered reliable. However, consistent replication does not guarantee the reliability of the generated text.

Even with deterministic settings, a single LLM output remains a sample from the model’s probability distribution, subject to inherent randomness. This results in "fixed randomness," which can lead to significant limitations (Hellrich and Hahn, 2016).

Our framework evaluates the reliability of the LLM-as-a-judge paradigm across diverse question formats and difficulty levels. Using LLM responses to benchmark questions, judgment LLMs are prompted repeatedly to select the "best" response based on factors like accuracy, utility, and relevance. The LLM is prompted for evaluation 100 times, varying only the replication while holding other factors constant. All judgment results are then assessed to uncover reliability by applying McDonald’s omega.

!Pasted image 20250607074923.png