Can model explanations help humans predict what models actually do?
Do explanations that sound plausible to humans actually help them forecast model behavior on new cases? Understanding this gap matters because RLHF optimizes for plausible explanations, not predictive ones.
"Do Models Explain Themselves?" introduces a rigorous evaluation framework for model explanations: can the explanation help a human predict what the model would do on related but different inputs? If a model answers "yes" to "Can eagles fly?" with the explanation "all birds can fly," then a human would infer it also answers "yes" to "Can penguins fly?" If the model actually says "no," the explanation was imprecise — it gave the human a wrong mental model.
Two metrics operationalize this:
- Simulation precision: the fraction of counterfactuals where human inference (from the explanation) matches the model's actual output
- Simulation generality: the diversity of counterfactuals relevant to the explanation
The key finding: precision does not correlate with plausibility. Explanations that humans judge as factually correct and logically coherent do NOT enable accurate prediction of model behavior. This means RLHF — which optimizes for human approval of explanations — will improve plausibility (explanations that look good) without improving precision (explanations that predict behavior). The model learns to generate explanations humans like, not explanations humans can use.
The second finding reinforces this: GPT-4 approximates human simulators with comparable inter-annotator agreement, and its agreement with humans is sometimes higher than human-human agreement. This validates GPT-4 as a precision evaluator but also underscores that the precision problem is not a measurement issue — it is genuine.
The implication for the CoT-as-explanation paradigm is severe. The entire interpretability case for chain-of-thought rests on the assumption that reading the trace helps users understand how the model works. But if explanation precision is low, users build incorrect mental models from CoT. Since Do chain of thought traces actually help humans understand reasoning?, optimizing for better-looking traces (via RLHF) will make the mental model problem worse, not better — users will be more confident in less accurate predictions.
Source: Reasoning o1 o3 Search
Related concepts in this collection
-
Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
weak correlation between CoT quality and output quality is the production-system version of low counterfactual simulatability
-
Do chain of thought traces actually help humans understand reasoning?
When models show their work through chain of thought traces, do humans find them interpretable? Research tested whether the traces that improve model performance also improve human understanding.
decoupled objectives: precision ≠ plausibility is the metric-level evidence for this architectural claim
-
Do users worldwide trust confident AI outputs even when wrong?
Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
users trust plausible explanations the same way they trust confident outputs; both fail prediction
-
Can we detect memorable moments by observing emotional expressions?
Emotion recognition systems assume that detecting emotional moments will identify what people remember. But does observed emotion in group settings actually predict individual memorability, or does the proxy fail?
analogous proxy failure: plausible-looking explanations don't predict actual understanding, just as emotional-looking moments don't predict actual memorability; both demonstrate that observable surface features diverge from the functional process they are assumed to index
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
counterfactual simulatability of llm explanations is low and uncorrelated with plausibility — rlhf cannot fix explanation precision