Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
Recent studies on reasoning in language models (LMs) have sparked a debate on whether they can learn systematic inferential principles or merely exploit superficial patterns in the training data. To understand and uncover the mechanisms adopted for formal reasoning in LMs, this paper presents a mechanistic interpretation of syllogistic inference. Specifically, we present a methodology for circuit discovery aimed at interpreting content-independent and formal reasoning mechanisms. Through two distinct intervention methods, we uncover a sufficient and necessary circuit involving middle-term suppression that elucidates how LMs transfer information to derive valid conclusions from premises. Furthermore, we investigate how belief biases manifest in syllogistic inference, finding evidence of partial contamination from additional attention heads responsible for encoding commonsense and contextualized knowledge. Finally, we explore the generalization of the discovered mechanisms across various syllogistic schemes, model sizes and architectures. The identified circuit is sufficient and necessary for syllogistic schemes on which the models achieve high accuracy (≥ 60%), with compatible activation patterns across models of different families. Overall, our findings suggest that LMs learn transferable content independent reasoning mechanisms, but that, at the same time, such mechanisms do not involve generalizable and abstract logical primitives, being susceptible to contamination by the same world knowledge acquired during pre-training.
Through mechanistic interpretability techniques such as Activation Patching (Meng et al., 2022) and embedding space analysis (i.e., Logit Lens) (Nostalgebraist, 2020; Geva et al., 2022; Dar et al., 2023), we investigate the following main research questions: RQ1: How is the content-independent syllogistic reasoning mechanism internalized in LMs during pre-training?; RQ2: Are content independent mechanisms disentangled from specific world knowledge and belief biases?; RQ3: Does the core reasoning mechanism generalize across syllogistic schemes, different model sizes, and architectures?
To answer these questions, we present a methodology that consists of 3 main stages. First, we define a syllogistic completion task designed to assess the model’s ability to predict valid conclusions from premises and facilitate the construction of test sets for circuit analysis. Second, we implement a circuit discovery pipeline on the syllogistic schema instantiated only with symbolic variables (Table 1, Symbolic) to identify the core sub-components responsible for content-independent reasoning. We conduct this analysis under two intervention methods: middle-term corruption and all-term corruption, aiming to identify latent transitive reasoning mechanisms and term-related information flow.
Third, we investigate the generalization of the identified circuit on concrete schemes instantiated with commonsense knowledge to identify potential belief biases and explore how the internal behavior varies with different schemes and model sizes. We present the following overall conclusions:
The circuit analysis reveals that LMs develop specific inference mechanisms during pre-training, finding evidence supporting a three-stage mechanism for syllogistic reasoning: (1) naive recitation of the first premise; (2) suppression of duplicated middle-term information; and (3) mediation towards the correct output through the interplay of mover attention heads (see Figure 1).
Further experiments on circuit transferability demonstrate that the identified mechanism is still necessary for reasoning on syllogistic schemes instantiated with commonsense knowledge. However, a deeper analysis suggests that specific belief biases acquired during pre-training might contaminate the content-independent circuit mechanism with additional attention heads responsible for encoding contextualized world knowledge.
We found that the identified circuit is sufficient and necessary for all the unconditionally valid syllogistic schemes in which the model achieves high downstream accuracy (≥ 60%) (see Appendix A for the list of schemes). This result suggests that LMs learn reasoning mechanisms that are transferable across different schemes.
The intervention results on models with different architectures and sizes (i.e., GPT-2 (Radford et al., 2019), Pythia (Biderman et al., 2023), Llama (et al., 2024), Qwen (Yang et al., 2024)) show similar suppression mechanism patterns and information flow. However, we found evidence that the contribution of attention heads becomes more complex with model sizes, further supporting the hypothesis of increasing contamination from external world knowledge.
Our study is conducted using Transformer- Lens (Nanda and Bloom, 2022) on an Nvidia A100 GPU with 80GB of memory. The dataset and code to reproduce our experiments are available online to encourage future work in the field1.