Language Understanding and Pragmatics Psychology and Social Cognition Reinforcement Learning for LLMs

Why do models trust their own generated answers?

Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.

Note · 2026-02-22 · sourced from Reasoning Methods CoT ToT
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

Self-detection — the use of a model's own capabilities to evaluate the trustworthiness of its outputs — is a widely used approach to hallucination mitigation and output quality assessment. The "Think Twice Before Trusting" paper identifies a fundamental structural problem with it: LLMs have an inherent bias toward trusting their own generated answers.

Two paradigms of self-detection both fail in the same direction:

The mechanism is not random — it is structural. The same training process that produced the incorrect answer also evaluates whether that answer is correct. Distributional bias toward self-agreement is baked into the model: responses the model generated are, by definition, high-probability outputs, and high-probability outputs feel more "correct" to the evaluating model. This is a form of Why do language models avoid correcting false user claims? applied at the output-evaluation level: the model accommodates its own prior outputs rather than critically assessing them.

The proposed fix — evaluating trustworthiness by comparing the generated answer against a broader answer space — breaks the self-agreement loop. When the model must justify multiple candidate answers (not just its own), the strong justifications available for correct alternatives counterbalance the bias toward the generated answer.

This connects to Does revising your own reasoning actually help or hurt?: both findings identify the same asymmetry — external perspective breaks the self-referential loop, internal perspective perpetuates it. The difference is that self-detection failure is specifically about the evaluation act, while revision source failure is about the correction act.

For deployment: systems that use LLM self-evaluation as a reliability signal (e.g., uncertainty estimation, output filtering) are implicitly assuming models can detect their own errors. This assumption is false when errors are systematic. The signal is reliable only for idiosyncratic errors the model would not generate with high confidence — the cases where self-detection is needed least.


Source: Reasoning Methods CoT ToT

Related concepts in this collection

Concept map
17 direct connections · 185 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llm self-detection fails because models have inherent bias toward trusting their own generated answers