Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accuracy, overlooking the necessity of well-calibrated confidence, which is vital for adaptive and reliable evaluation pipelines. In this work, we advocate a shift from accuracy-centric evaluation to confidence-driven, risk-aware LLM-as-a-Judge systems, emphasizing the necessity of well-calibrated confidence for trustworthy and adaptive evaluation. We systematically identify the Overconfidence Phenomenon in current LLM-as-a- Judges, where predicted confidence significantly overstates actual correctness, undermining reliability in practical deployment. To quantify this phenomenon, we introduce THScore, a novel metric measuring confidence-accuracy alignment. Furthermore, we propose LLM-as-a-Fuser, an ensemble framework that transforms LLMs into reliable, risk-aware evaluators. Extensive experiments demonstrate that our approach substantially improves calibration and enables adaptive, confidence-driven evaluation pipelines, achieving superior reliability and accuracy compared to existing baselines.
The widespread adoption of large language models (LLMs) as automated judges—termed the LLM-as-a- Judge paradigm—has revolutionized the evaluation of AIgenerated content by offering scalability and efficiency over traditional human annotation (Zheng et al. 2023). In this paradigm, LLMs act as evaluators, with one common application being pairwise comparisons where the model decides which of two text segments is better based on criteria like quality, relevance, or coherence. However, the practical value of these systems depends not only on accuracy but also on trustworthy, risk-aware judgments that can adapt to real-world deployment scenarios. Existing approaches, such as FairEval (Wang et al. 2023a) and JudgeBench (Tan et al. 2024), predominantly emphasize accuracy, often overlooking the critical role of well-calibrated confidence. This calibration, defined as the alignment between a model’s predicted confidence and its actual correctness, is essential for building adaptive evaluation pipelines. For instance, well-calibrated confidence allows high-confidence outputs to be automatically accepted, minimizing manual intervention, while low-confidence cases can be flagged for human review (Li et al. 2024). In this work, we advocate a fundamental shift from accuracy-centric evaluations to confidence-driven, risk-aware LLM-as-a-Judge systems, prioritizing calibration to ensure reliable and trustworthy assessments.
Despite these potential benefits, current LLM-as-a-Judge systems suffer from a pervasive Overconfidence Phenomenon, where predicted confidence levels significantly overstate actual correctness (Mielke et al. 2022; Zhou, Jurafsky, and Hashimoto 2023), thereby undermining reliability in practical applications. Through systematic analysis, we observe that state-of-the-art LLMs exhibit this issue prominently, leading to inflated confidence scores that do not reflect true performance (Zhao et al. 2021). This misalignment results in substantial risks: overconfident models may propagate erroneous judgments without detection, eroding the efficiency gains of automated evaluation, while also complicating downstream decision-making in pipelines (Gu et al. 2024). Furthermore, existing benchmarks and metrics exacerbate the problem by focusing on aggregate accuracy without addressing confidence alignment, introducing biases such as response length or model familiarity that distort calibration assessments (Chen et al. 2024; Zheng et al. 2023; Wang et al. 2023a). Consequently, the lack of calibration-aware tools limits the deployment of LLMs as dependable evaluators in high-stakes environments.
To address these challenges, we introduce TH-Score, a novel metric that quantifies confidence-accuracy alignment by focusing on critical high- and low-confidence intervals, where practical decisions hinge. Unlike traditional metrics like accuracy or Expected Calibration Error (ECE)—which ignore confidence or overlook key thresholds—TH-Score balances accuracy within these intervals against their coverage, rewarding aligned successes (e.g., high-confidence correct predictions) while penalizing mismatches like overconfident errors. This makes TH-Score a principled tool for detecting the Overconfidence Phenomenon under LLM-as-a-Judge scenario, highlighting cases where high confidence fails to match actual correctness.