Why do models trust their own generated answers?
Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.
Self-detection — the use of a model's own capabilities to evaluate the trustworthiness of its outputs — is a widely used approach to hallucination mitigation and output quality assessment. The "Think Twice Before Trusting" paper identifies a fundamental structural problem with it: LLMs have an inherent bias toward trusting their own generated answers.
Two paradigms of self-detection both fail in the same direction:
- Confidence calibration: Sampling multiple answers and checking agreement. Fails when errors are consistent — the model generates the same wrong answer repeatedly with high self-agreement.
- Self-evaluation: Directly asking the model whether its answer is correct. Fails because the model is biased toward validating what it generated.
The mechanism is not random — it is structural. The same training process that produced the incorrect answer also evaluates whether that answer is correct. Distributional bias toward self-agreement is baked into the model: responses the model generated are, by definition, high-probability outputs, and high-probability outputs feel more "correct" to the evaluating model. This is a form of Why do language models avoid correcting false user claims? applied at the output-evaluation level: the model accommodates its own prior outputs rather than critically assessing them.
The proposed fix — evaluating trustworthiness by comparing the generated answer against a broader answer space — breaks the self-agreement loop. When the model must justify multiple candidate answers (not just its own), the strong justifications available for correct alternatives counterbalance the bias toward the generated answer.
This connects to Does revising your own reasoning actually help or hurt?: both findings identify the same asymmetry — external perspective breaks the self-referential loop, internal perspective perpetuates it. The difference is that self-detection failure is specifically about the evaluation act, while revision source failure is about the correction act.
For deployment: systems that use LLM self-evaluation as a reliability signal (e.g., uncertainty estimation, output filtering) are implicitly assuming models can detect their own errors. This assumption is false when errors are systematic. The signal is reliable only for idiosyncratic errors the model would not generate with high confidence — the cases where self-detection is needed least.
Source: Reasoning Methods CoT ToT
Related concepts in this collection
-
Does revising your own reasoning actually help or hurt?
Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
same asymmetry, adjacent mechanism: external perspective helps, internal self-reference degrades
-
Why do language models avoid correcting false user claims?
Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.
face-saving at self-evaluation level: the model validates its own output as a form of face-maintenance
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
Degeneration-of-Thought is the multi-turn version of self-trust failure; both document increasing confidence in wrong answers through self-reference
-
Can models abandon correct beliefs under conversational pressure?
Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
both involve belief formation errors in the presence of one information source; external pressure vs. self-generated pressure
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm self-detection fails because models have inherent bias toward trusting their own generated answers