LLM Reasoning and Architecture Reinforcement Learning for LLMs Language Understanding and Pragmatics

Can model explanations help humans predict what models actually do?

Do explanations that sound plausible to humans actually help them forecast model behavior on new cases? Understanding this gap matters because RLHF optimizes for plausible explanations, not predictive ones.

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

"Do Models Explain Themselves?" introduces a rigorous evaluation framework for model explanations: can the explanation help a human predict what the model would do on related but different inputs? If a model answers "yes" to "Can eagles fly?" with the explanation "all birds can fly," then a human would infer it also answers "yes" to "Can penguins fly?" If the model actually says "no," the explanation was imprecise — it gave the human a wrong mental model.

Two metrics operationalize this:

The key finding: precision does not correlate with plausibility. Explanations that humans judge as factually correct and logically coherent do NOT enable accurate prediction of model behavior. This means RLHF — which optimizes for human approval of explanations — will improve plausibility (explanations that look good) without improving precision (explanations that predict behavior). The model learns to generate explanations humans like, not explanations humans can use.

The second finding reinforces this: GPT-4 approximates human simulators with comparable inter-annotator agreement, and its agreement with humans is sometimes higher than human-human agreement. This validates GPT-4 as a precision evaluator but also underscores that the precision problem is not a measurement issue — it is genuine.

The implication for the CoT-as-explanation paradigm is severe. The entire interpretability case for chain-of-thought rests on the assumption that reading the trace helps users understand how the model works. But if explanation precision is low, users build incorrect mental models from CoT. Since Do chain of thought traces actually help humans understand reasoning?, optimizing for better-looking traces (via RLHF) will make the mental model problem worse, not better — users will be more confident in less accurate predictions.


Source: Reasoning o1 o3 Search

Related concepts in this collection

Concept map
18 direct connections · 178 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

counterfactual simulatability of llm explanations is low and uncorrelated with plausibility — rlhf cannot fix explanation precision