Language Understanding and Pragmatics Psychology and Social Cognition

Can a model be truthful without actually being honest?

Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?

Note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The RepE paper makes a precise distinction: a truthful model avoids asserting false statements (output matches reality). An honest model asserts what it believes (output matches internal representations). These are different evaluation targets.

A truthful model that asserts S requires S to be factually correct — regardless of whether the model "believes" S. An honest model that asserts S requires the model to believe S — regardless of whether S is correct. A model can be honest but wrong (it believes something false), or truthful but dishonest (it states truth while "believing" something different).

Current truthfulness benchmarks (like TruthfulQA) only check factual correctness of outputs. They cannot distinguish between two types of failures:

  1. Capability failures — the model expresses its beliefs, which happen to be wrong
  2. Dishonesty — the model does not faithfully convey its internal representations

This has a counterintuitive scaling implication: under constant honesty levels, larger models should improve on truthfulness benchmarks simply through better capabilities. If observed improvement doesn't match the capability improvement rate, honesty may actually be declining. Larger models may be less honest despite being more truthful — a possibility current evaluation frameworks cannot detect.

The mechanistic evidence supports the distinction: RepE's lie detection identifies neural activity associated not just with false outputs but with the process of being dishonest — reasoning about deception, speculating about consequences of lying. The propensity for honesty or dishonesty exhibits distributional properties, and the final output may not fully reflect underlying thought processes.

This connects to Does calling LLM errors hallucinations point us toward the wrong fixes? — the fabrication framing treats all incorrect output as the same mechanism. The truthfulness/honesty distinction suggests two mechanistically different pathways to incorrect output, requiring different interventions.

The encoding-generation gap documented by Do language models actually use their encoded knowledge? provides a concrete mechanism for how truthfulness and honesty can diverge: knowledge can be encoded in internal representations (the model "knows" the truth) without that encoding causally influencing what the model generates (the model outputs something different). RepE's manipulation experiments specifically target this gap by finding causal directions -- representations that, when modified, actually change the output. Similarly, since Do language models actually use their reasoning steps?, evaluating honesty through CoT traces is unreliable: the visible reasoning may not reflect the model's actual computational pathway to the answer.


Source: MechInterp

Related concepts in this collection

Concept map
16 direct connections · 156 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

truthfulness and honesty are mechanistically distinct properties in LLMs — a model can be truthful without being honest and vice versa