Investigating Gender Bias in Language Models Using Causal Mediation Analysis

Paper · Source

The success of neural network models in various natural language processing tasks, coupled with their opaque nature, has led to much interest in interpreting and analyzing such models. One goal of these analyses is to identify whether a model utilizes latent information in its internal representations to arrive at a prediction. This is of particular importance when diagnosing the reasons for a biased prediction. A popular class of analysis methods, often called structural analysis, aims to extract this information using probing classifiers that predict linguistic properties from representations of trained models (e.g., Adi et al., 2017; Conneau et al., 2018; Hupkes et al., 2018; Tenney et al., 2019). However, since probing classifiers only yield a correlational measure between a model’s representations and an external property (Belinkov and Glass, 2019), they cannot show if the property is causally connected to the model’s predictions. Moreover, Barrett et al. (2019) showed that probing classifiers may generate unfaithful interpretations and fail to generalize to unseen data.

We introduce a methodology for interpreting neural models to address these limitations. We adapt causal mediation analysis (Pearl, 2001) for analyzing the mechanisms by which information flows from input to output through different model components. Mediation analysis relies on measuring the change in an output following a counterfactual intervention in an intermediate variable, or mediator. Through such interventions, one can measure the degree to which inputs influence outputs directly (direct effect), or indirectly through the mediator (indirect effect). In our case, the mediator can be any model components that we wish to study, such as neurons or attention heads. We propose multiple controlled interventions in these mediators, which reveal their causal role in a model’s behavior.

In an experiment using several datasets designed to gauge a model’s gender bias, we find that gender bias effects increase with larger models, which potentially absorb more bias from the underlying training data. The causal mediation analysis further reveals that gender bias is sparse, with much of the effect concentrated in a relatively small proportion of neurons and attention heads.