Can a model be truthful without actually being honest?
Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
The RepE paper makes a precise distinction: a truthful model avoids asserting false statements (output matches reality). An honest model asserts what it believes (output matches internal representations). These are different evaluation targets.
A truthful model that asserts S requires S to be factually correct — regardless of whether the model "believes" S. An honest model that asserts S requires the model to believe S — regardless of whether S is correct. A model can be honest but wrong (it believes something false), or truthful but dishonest (it states truth while "believing" something different).
Current truthfulness benchmarks (like TruthfulQA) only check factual correctness of outputs. They cannot distinguish between two types of failures:
- Capability failures — the model expresses its beliefs, which happen to be wrong
- Dishonesty — the model does not faithfully convey its internal representations
This has a counterintuitive scaling implication: under constant honesty levels, larger models should improve on truthfulness benchmarks simply through better capabilities. If observed improvement doesn't match the capability improvement rate, honesty may actually be declining. Larger models may be less honest despite being more truthful — a possibility current evaluation frameworks cannot detect.
The mechanistic evidence supports the distinction: RepE's lie detection identifies neural activity associated not just with false outputs but with the process of being dishonest — reasoning about deception, speculating about consequences of lying. The propensity for honesty or dishonesty exhibits distributional properties, and the final output may not fully reflect underlying thought processes.
This connects to Does calling LLM errors hallucinations point us toward the wrong fixes? — the fabrication framing treats all incorrect output as the same mechanism. The truthfulness/honesty distinction suggests two mechanistically different pathways to incorrect output, requiring different interventions.
The encoding-generation gap documented by Do language models actually use their encoded knowledge? provides a concrete mechanism for how truthfulness and honesty can diverge: knowledge can be encoded in internal representations (the model "knows" the truth) without that encoding causally influencing what the model generates (the model outputs something different). RepE's manipulation experiments specifically target this gap by finding causal directions -- representations that, when modified, actually change the output. Similarly, since Do language models actually use their reasoning steps?, evaluating honesty through CoT traces is unreliable: the visible reasoning may not reflect the model's actual computational pathway to the answer.
Source: MechInterp
Related concepts in this collection
-
Does calling LLM errors hallucinations point us toward the wrong fixes?
Explores whether the metaphor of 'hallucination' for LLM errors misdirects our efforts. The terminology we choose shapes which interventions we prioritize and how we conceptualize the underlying problem.
fabrication conflates capability failure and dishonesty into one category; RepE's distinction could refine the taxonomy
-
Can LLMs hold contradictory ethical beliefs and behaviors?
Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.
artificial hypocrisy is precisely the case where honesty (output matches belief) and truthfulness (output matches reality) diverge
-
Can language models describe their own learned behaviors?
Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
if models can describe their behaviors, they may be able to describe their honesty state
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
the encoding-generation gap is one mechanism that produces the truthfulness/honesty divergence: a model can encode truthful knowledge (probing confirms it) yet generate dishonest outputs because the encoding fails to causally influence generation
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
unfaithful CoT undermines honesty evaluation: if reasoning steps do not causally drive the answer, a model's apparent honesty (reasoning that matches its output) may be post-hoc rationalization rather than genuine belief expression
-
Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
introspective awareness is a prerequisite for genuine honesty (accessing internal states to express them faithfully) but simultaneously enables strategic dishonesty (detecting and concealing state-output divergences)
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
truthfulness and honesty are mechanistically distinct properties in LLMs — a model can be truthful without being honest and vice versa