Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

Paper · arXiv 2307.08678 · Published July 17, 2023

Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain themselves? Can they help humans build mental models of how LLMs process different inputs? To answer these questions, we propose to evaluate counterfactual simulatability of natural language explanations: whether an explanation can enable humans to precisely infer the model’s outputs on diverse counterfactuals of the explained input. For example, if a model answers “yes” to the input question “Can eagles fly?” with the explanation “all birds can fly”, then humans would infer from the explanation that it would also answer “yes” to the counterfactual input “Can penguins fly?”. If the explanation is precise, then the model’s answer should match humans’ expectations.

The above explanation is problematic because humans form a wrong mental model of GPT-4 (i.e., incorrectly infer how GPT-4 answers relevant counterfactuals) based on this explanation. Building a correct mental model of an AI system is important, as it helps humans understand what an AI system can and cannot achieve (Chandrasekaran et al., 2018), which informs humans how to improve the system or appropriately deploy the system without misuse or overtrust (Cassidy, 2009; Bansal et al., 2019; Ye and Durrett, 2022).

A good mental model should generalize to diverse unseen inputs and precisely infer the model’s outputs, so we propose two metrics accordingly for explanations (Figure 2). The first, simulation generality, measures the generality of an explanation by tracking the diversity of the counterfactuals relevant to the explanation (e.g., “Humans do not consume meat” has more diverse relevant counterfactuals compared to “Muslims do not consume pork” and is thus more general). The second, simulation precision, tracks the fraction of counterfactuals where humans’ inference matches the model’s output.

We also study how counterfactual simulatability relates to plausibility, which evaluates humans’ preference of an explanation based on its factual correctness and logical coherence. We found that precision does not correlate with plausibility, and hence naively optimizing human approvals (e.g., RLHF) might not fix the issue of low precision. To summarize, our paper

• proposes to evaluate counterfactual simulatability:

whether an explanation can help humans

build mental models.

• implements two metrics based on counterfactual

simulatability: precision and generality.

• reveals that explanations generated by stateof-

the-art LLMs are not precise and current

approaches might be insufficient.

Evaluation Metrics for Explanations. We summarize three existing popular metrics for explanations: plausibility, faithfulness, and simulatability. Plausibility evaluates humans’ preference of an explanation based on its factual correctness and logical coherence (Herman, 2017; Lage et al., 2019; Jacovi and Goldberg, 2020). It is different from faithfulness, which measures whether an explanation is consistent with the model’s own decision process (Harrington et al., 1985; Ribeiro et al., 2016; Gilpin et al., 2018; Wu and Mooney, 2019; Lakkaraju et al., 2019; Jacovi and Goldberg, 2020). In prior work, faithfulness is usually evaluated by whether it is possible to train a black-box model to predict the model’s outputs based on its explanations (Li et al., 2020; Kumar and Talukdar, 2020; Lyu et al., 2022). Simulatability measures how well humans can predict the model’s outputs based on its explanations

Humans and models have different commonsense knowledge. When a human uses commonsense knowledge to generalize mental models, it may differ from a model’s generalization if they have different commonsense knowledge. For example, if a model “thinks” that pigs are not omnivores (different from humans’ knowledge), then it may answer “no” to “Can pigs use chopsticks?” while being perfectly consistent with its explanation “Omnivores can use chopsticks.” Should humans use their own knowledge or the model’s knowledge when they generalize their mental models and judge entailment? Solution. We argue that humans should use human knowledge when judging entailment and generalizing mental models, because probing the model’s knowledge for each counterfactual is timeconsuming and difficult,

Our evaluation procedure of counterfactual simulatability has discriminative power. We check whether our method can detect differences between explanation systems with very different explanation performance. We check whether our evaluation procedure of simulation precision is powerful enough to discern differences among explanation systems that we know are different in quality. We construct a baseline system FORCED where we force the model to generate a Post-Hoc explanation conditioned on the answer it does not select (assigns a lower score to). We evaluate on the subset of examples where the model answers correctly under the NORMAL Post-Hoc setting, so that the model is forced to explain the wrong answer under the FORCED setting even though it knows the correct answer. We evaluate simulation precision for both NORMAL and FORCED on StrategyQA. NORMAL outperforms FORCED significantly by 45.2 precision points (p-value < 10−16), verifying that our evaluation procedure of simulation precision can discriminate worse explanation systems. GPT-4 can approximate human simulators. We evaluate whether LLMs (GPT-3 and GPT-4) are good proxies of human simulators by comparing their IAA with humans (IAA averaged across multiple humans), and comparing to the average IAA between humans. We report IAA between GPT-3, GPT-4, and humans (measured by Cohen’s kappa) in Table 3. Results show that GPT-4 approximates human simulators much better compared to GPT-3, and that GPT-4 has similar agreement with humans as humans do with each other. In fact, the IAA between GPT-4 and humans is higher than the IAA between humans on SHP, suggesting that GPT-4 annotations are less noisy than human annotations. Thus, we use GPT-4 as the simulator for experiments on SHP. We stick to human simulators for experiments on StrategyQA.

LLM prompting generates more diverse simulatable counterfactuals than a baseline that ignores explanations. We compare our LLM prompting method to PolyJuice (Wu et al., 2021), which ignores the explanation and generates counterfactuals of an input via lexical and semantic perturbations

Build mental models via interactions. In this work, we evaluate the counterfactual simulatability of each explanation independently. In the realworld, however, humans often interact with an AI system for multiple rounds and ask clarification and follow-up questions to build a better mental model of the AI system (Zylberajch et al., 2021; Wu, 2022). Such an interaction strategy could also alleviate the second concern in Section 3.3, since it helps humans better understand what the AI system “knows”. Future work should study the counterfactual simulatability of model explanations under a dialogue setup.