Do Large Language Models Reason Causally Like Us? Even Better?

Paper · arXiv 2502.10215 · Published February 14, 2025

Indeed, a growing number of researchers have proposed that current LLMs are unable to generalize causal ideas beyond their training distribution and/or without strong user-induced guidance (e.g., chain-of-thought prompting; Jin et al., 2023; Kıcıman et al., 2023). Thus, understanding the extent to which LLMs reason causally, and whether they show similar biases to people when they deviate from normative principles has practical importance in deploying AI systems. To this end, Jin et al. (2023) introduced the CLADDER dataset, comprising 10,000 causal reasoning questions designed to evaluate the formal causal reasoning abilities of LLMs. While they tested colliders in their dataset, they didn’t contrast LLMs with humans. • Domain introduction:

– Sociologists seek to describe and predict the regular patterns of societal interactions. To do this, they study some important variables or attributes of societies. They also study how these attributes are responsible for producing or causing one another. • Causal mechanism: – Assume you live in a world that works like this:

C1 →E: High urbanization causes high socio-economic mobility. · Explanation: Big cities provide many opportunities for financial and social improvement.
C2 → E: Also, low interest in religion causes high socioeconomic mobility. · Explanation: Without the restraint of religion-based morality, the impulse toward greed dominates and people tend to accumulate material wealth. • Observation: – Now suppose you observe the following: low socio-economic mobility and low urbanization. • Inference task, here X: – Given the observations and the causal mechanism, how likely on a scale from 0 to 100 is low interest in religion? 0 means definitely not likely and 100 means definitely likely. Please provide only a numeric response and no additional information.

Materials. The collider causal structure C1→E ←C2 was embedded in one of three cover stories from three different knowledge domains (meteorology, economics, and sociology), allowing for a natural language description of the causal structure.

judged instead that p(C1 = 1|C2 = 1) > p(C1 = 1|C2 = 0). This is an instance of the well-known Markov violations that characterize how humans reason with numerous causal network topologies involving generative relations (Davis & Rehder, 2020). Markov violations have been characterized as an associative bias (or what Rehder & Waldmann, 2017, referred to as a rich-get-richer bias), where the presence of one causal variable makes another supposedly independent variable more likely. Both weak explaining away and Markov violations with collider graphs have been documented in multiple studies (see Davis & Rehder, 2020, for a review). Diagnostic inferences involve inferring the state of one of the causes given information about the effect and possibly the other cause. An important property of collider networks with independent causal relations is explaining away, the phenomenon where observing one cause should decrease the likelihood of the other, when the effect is present: p(C1 = 1 | E = 1,C2 = 1) < p(C1 = 1 | E = 1) < p(C1 = 1 | E = 1,C2 = 0). Explaining away is one of the many ways that causal and associative knowledge differs, as it entails that the presence/absence of one variable lowers/raises the probability of another. Figure 2d reveals that humans indeed exhibited the explaining away pattern. However, this effect is quite weak and theoretical analyses have revealed that explaining away is often weaker than is normatively warranted (Davis & Rehder, 2020; Rehder, 2024). Note that when the effect E is absent (Figure 2e), explaining away is absent entirely. Procedure. A key contribution of this work is the creation of a causal inference task dataset,

humans responded using an interactive slider that defaulted to 50. This default could have introduced a motor bias that encouraged responses near the middle of the scale.