The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs

Paper · arXiv 2508.08285 · Published August 1, 2025
EvaluationsFlaws

Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods, ultimately ensuring the trustworthiness of LLM outputs.

To establish a human-aligned benchmark, we collect human judgments of factual correctness and compare metric outputs against these gold labels. We find that ROUGE exhibits alarmingly low precision for identifying actual factual errors. In contrast, an LLM-as-Judge approach (Zheng et al., 2023a) aligns far more closely with human assessments. Based on these insights, we re-evaluate existing detection methods under both ROUGE and human-aligned criteria, revealing dramatic performance drops (up to 45.9% for Perplexity and 30.4% for Eigenscore) when moving from ROUGE to LLM-as-Judge evaluation (see Figure 1).

Finally, we uncover a surprising baseline: simple length-based heuristics (e.g., mean and standard deviation of answer length) rival or exceed sophisticated detectors like Semantic Entropy. Through controlled experiments that isolate length effects, we show how ROUGE can be manipulated via trivial repetition, even when factual content remains constant. Our findings expose a widespread overestimation of current methods and underscore the urgent need for more reliable, human-aligned evaluation metrics in QA hallucination detection. Our study makes the following key contributions: