Evaluating the False Trust Engendered by LLM Explanations

Paper · arXiv 2605.10930

Large Language Models (LLMs) and Large Reasoning Models (LRMs) are increasingly used for critical tasks, yet they provide no guarantees about the correctness of their solutions. Users must decide whether to trust the model’s answer, aided by reasoning traces, their summaries, or post-hoc generated explanations. These reasoning traces, despite evidence that they are neither faithful representations of the model’s computations nor necessarily semantically meaningful, are often interpreted as provenance explanations. It is unclear whether explanations or reasoning traces help users identify when the AI is incorrect, or whether they simply persuade users to trust the AI regardless. In this paper, we take a user-centered approach and develop an evaluation protocol to study how different explanation types affect users’ ability to judge the correctness of AI-generated answers and engender false trust in the users. We conduct a between-subject user study, simulating a setting where users do not have the means to verify the solution and analyze the false trust engendered by commonly used LLM explanations — reasoning traces, their summaries and post-hoc explanations. We also test a contrastive dual explanation setting where we present arguments for and against the AI’s answer. We find that reasoning traces and post-hoc explanations are persuasive but not informative: they increase user acceptance of LLM predictions regardless of their correctness. In contrast, dual explanation is the only condition that genuinely improves users’ ability to distinguish correct from incorrect AI outputs.

With the advent of Large Language models (LLMs), the field of AI has been revolutionized. LLMs have demonstrated impressive capabilities across a wide array of tasks, from planning and reasoning to instruction following and story writing. Unlike traditional AI systems, these generative language models are broad and general purpose models, leading to their rapid adoption by lay users. However, this adoption comes with a caveat. Large Language Models confidently answer any question put to them, regardless of whether the answer is correct. As these systems are increasingly used for critical decisions in healthcare, education, and legal domains, two critical questions arise: how do we develop methods to engender appropriate rather than uncritical trust in end users, and how do we evaluate the effectiveness of such methods.

Our study reveals that reasoning traces and post-hoc explanations significantly increased false trust engendered in users, as compared to our baseline where no explanation was provided. In an attempt to mitigate this critical issue, we evaluate a post hoc explanation generation paradigm - dual explanations, similar to contrastive explanation generation, where the user is exposed to the merits and demerits of the LLM generated answer. Overall, we find that the dual explanations have the lowest rate of engendering false trust. We also observe that reasoning traces and dual explanations achieve the highest rates of appropriate trust, though for fundamentally different reasons. Reasoning traces tend to increase user confidence in LLM-generated answers overall, leading to high accuracy in identifying correct answers but poor performance in detecting incorrect ones. In contrast, dual explanations exhibits a more balanced effect: users maintain relatively high accuracy in identifying both correct and incorrect answers.

Post-hoc explanations can be overly persuasive and engender false trust in users: Post-hoc explanations arguing for the answer’s correctness also engenders high false trust and critically hamper the user’s ability to distinguish between correct and incorrect answers. We speculate this is because these explanations are generated by models heavily optimized through RLHF to produce responses that satisfy users. In [11], authors identify satisfaction as a key property of explanations in human-AI interaction : the explanation should leave the user feeling that they understand the AI’s reasoning. LLM generated explanations excel at this, as they are optimized to produce responses that are helpful, warm, and satisfying to users. Recent work has shown that such optimization can increase sycophancy and reduce the accuracy of the model i.e. the tendency to produce responses that agree with or please the user rather than responses that are correct. Models generating post-hoc explanations may be exhibiting similar tendencies.

Evaluating the False Trust Engendered by LLM Explanations

Synthesis notes that discuss concepts related to this paper