Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness?

Paper · arXiv 2004.03685 · Published April 7, 2020

With the growing popularity of deep-learning based NLP models, comes a need for interpretable systems. But what is interpretability, and what constitutes a high-quality interpretation? In this opinion piece we reflect on the current state of interpretability evaluation research. We call for more clearly differentiating between different desired criteria an interpretation should satisfy, and focus on the faithfulness criteria. We survey the literature with respect to faithfulness evaluation, and arrange the current approaches around three assumptions, providing an explicit form to how faithfulness is “defined” by the community. We provide concrete guidelines on how evaluation of interpretation methods should and should not be conducted. Finally, we claim that the current binary definition for faithfulness sets a potentially unrealistic bar for being considered faithful. We call for discarding the binary notion of faithfulness in favor of a more graded one, which we believe will be of greater practical utility.

There is considerable research effort in attempting to define and categorize the desiderata of a learned system’s interpretation, most of which revolves around specific use-cases (Lipton, 2018; Guidotti et al., 2018, inter alia).

Two particularly notable criteria, each useful for a different purposes, are plausibility and faithfulness. “Plausibility” refers to how convincing the interpretation is to humans, while “faithfulness” refers to how accurately it reflects the true reasoning process of the model (Herman, 2017;Wiegreffe and Pinter, 2019).

Despite the difference between the two criteria, many authors do not clearly make the distinction, and sometimes conflate the two.3 Moreoever, the majority of works do not explicitly name the criteria under consideration, even when they clearly belong to one camp or the other.4

We argue that this conflation is dangerous. For example, consider the case of recidivism prediction, where a judge is exposed to a model’s prediction and its interpretation, and the judge believes the interpretation to reflect the model’s reasoning process. Since the interpretation’s faithfulness carries legal consequences, a plausible but unfaithful interpretation may be the worst-case scenario. The lack of explicit claims by research may cause misinformation to potential users of the technology, who are not versed in its inner workings.5 Therefore, clear distinction between these terms is critical.

A distinction is often made between two methods of achieving interpretability: (1) interpreting existing models via post-hoc techniques; and (2) designing inherently interpretable models. Rudin (2018) argues in favor of inherently interpretable models, which by design claim to provide more faithful interpretations than post-hoc interpretation of blackbox models.

We warn against taking this argumentation at face-value: a method being “inherently interpretable” is merely a claim that needs to be verified before it can be trusted. Indeed, while attention mechanisms have been considered as “inherently interpretable” (Ghaeini et al., 2018; Lee et al., 2017), recent work cast doubt regarding their faithfulness (Serrano and Smith, 2019; Jain and Wallace, 2019; Wiegreffe and Pinter, 2019).