Do explanations actually help users spot AI mistakes?
Most AI explanations are designed to justify the system's answer, but do they help users distinguish correct from incorrect outputs? This research tests whether standard explanation formats genuinely improve error detection or just increase trust regardless of accuracy.
Users of LLMs must decide whether to trust an answer, often aided by reasoning traces, their summaries, or post-hoc explanations. The implicit assumption is that more explanation helps users judge correctness. A between-subject user study — simulating settings where users cannot independently verify the solution — tests this and finds the assumption largely false. Reasoning traces and post-hoc explanations are persuasive but not informative: relative to a no-explanation baseline, they increase user acceptance of the model's prediction regardless of whether that prediction is correct. They engender false trust.
The one condition that breaks the pattern is contrastive dual explanation, where the user is shown arguments both for and against the AI's answer. Dual explanation has the lowest rate of engendering false trust and is the only condition that genuinely improves users' ability to distinguish correct from incorrect outputs. The contrast with reasoning traces is instructive: traces produce high accuracy on correct answers but poor detection of incorrect ones (they raise confidence uniformly), whereas dual explanations produce a balanced effect — users stay accurate on both correct and incorrect cases.
Why it matters: the standard explanation formats deployed in production are optimized to be one-sided advocates for the answer, which is exactly what makes them persuasive without being diagnostic. Surfacing the case against the answer is what restores the user's discriminating capacity. The counterpoint, and the design lesson, is that "explainability" and "appropriate trust" can be at odds — adding a confident rationale can make a wrong answer more believable, so the intervention that helps is the one that deliberately argues against the system's own output.
— "Evaluating the False Trust Engendered by LLM Explanations", https://arxiv.org/abs/2605.10930
Related concepts in this collection
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
explains why traces persuade without informing — they look like reasoning but are not verified, and the user reads advocacy as evidence
-
Do users worldwide trust confident AI outputs even when wrong?
Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
one-sided explanations act like confidence signals, dominating users' accuracy tracking
-
Are AI explanations really descriptions or adoption arguments?
Most XAI work treats explanations as neutral descriptions of model behavior, but they may actually be doing persuasive work to justify AI adoption. What happens when we acknowledge this rhetorical function?
names the advocacy framing of explanations that dual explanation is designed to counterbalance
-
Can LLM explanations actually help humans predict model behavior?
Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
grounds the persuasive-not-informative finding mechanistically: explanations gain plausibility without gaining precision, so they raise acceptance without improving diagnosis
-
Can we distinguish helpful explanations from manipulative ones?
Rhetorical strategies used to justify appropriate AI adoption rely on the same persuasion mechanisms as dark patterns. Without observable intent, explanation and manipulation look identical—raising urgent questions about how to audit XAI systems responsibly.
extends the harm: one-sided rationales that engender false trust are the benign end of the same machinery that becomes a dark pattern
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
only contrastive dual explanations arguing both sides genuinely improve users ability to detect ai errors