Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection

Paper · arXiv 2403.09972 · Published March 15, 2024

Self-detection for Large Language Models (LLMs) seeks to evaluate the trustworthiness of the LLM’s output by leveraging its own capabilities, thereby alleviating the issue of output hallucination. However, existing self-detection approaches only retrospectively evaluate answers generated by LLM, typically leading to the over-trust in incorrectly generated answers. To tackle this limitation, we propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers. It thoroughly compares the trustworthiness of multiple candidate answers to mitigate the over-trust in LLM-generated incorrect answers. Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer, and then aggregates the justifications for comprehensive target answer evaluation. This framework can be seamlessly integrated with existing approaches for superior self-detection.

Previous studies in self-detection can be broadly categorized into two paradigms (cf. Figure 2). The first paradigm is confidence calibration, aiming to estimate LLM’s confidence on the generated answer to align with the actual answer accuracy via multi-answer sampling and aggregation (Xiong et al., 2023; Tian et al., 2023b; Si et al., 2022; Jiang et al., 2023). The second one is self-evaluation, which directly examines the compatibility of question and answer by designing various prompt strategies (Miao et al., 2023; Kadavath et al., 2022;Weng et al., 2023). These two paradigms have also been combined to enhance self-detection capabilities (Chen and Mueller, 2023; Ren et al., 2023a).

However, both self-detection paradigms have shown a significant drawback: an inclination towards over-trusting the incorrect answers generated by LLM (Si et al., 2022; Xiong et al., 2023; Jiang et al., 2023; Kadavath et al., 2022). We argue that one reason may be that both paradigms merely evaluate LLM-generated answers, while LLM contains an inherent bias towards trusting its own generations (Mielke et al., 2022; Lin et al., 2022a), leading to serious over-trust in LLM-generated incorrect answers. An ideal self-detection paradigm should consider a more comprehensive answer space beyond LLM’s generations. By evaluating on other potentially correct answers in a broader answer space, the strong validity in these answers can counterbalance the excessive trust in the incorrect LLM answers, thus alleviating the over-trust issue.

In this light, we introduce a new comprehensive answer evaluation paradigm involving the consideration of multiple candidate answers in the answer space to enhance self-detection (cf. Figure 2). This paradigm meticulously evaluates each answer’s trustworthiness as a correct answer to the question and aggregates these evaluations to enhance the self-detection of the target LLM answer. The biased trust in the LLM-generated incorrect answers can be alleviated through the trustworthiness comparison with other more trustable answers.

Our preliminary experiments reveal the efficacy of considering more comprehensive answers to confront over-trust (cf. Section 2). To summarize, two key considerations arise to instantiate this new paradigm: 1) resisting the inherent bias of LLM to precisely evaluate the trustworthiness of each question-answer pair, and 2) aggregating these evaluations in the trustworthiness evaluation of the target answer.

To this end, we present a novel self-detection framework to tackle the over-trust issue of LLMs, named Think Twice before Trusting (T3) (cf. Figure 1). Our framework pushes LLM to reflect and justify from different answers’ perspectives before arriving at the trustworthiness on the target answer. Firstly, the LLM is instructed to generate justifications regarding the potential correctness of each answer. Subsequently, a prompt-based method is employed to integrate these justifications into joint evaluation for the target answer. Extensive experiments on six datasets across three tasks on three different LLMs show improved performance of T3 over methods from existing paradigms.