Language models show human-like content effects on reasoning tasks

Paper · arXiv 2207.07051 · Published July 14, 2022
Linguistics, NLP, NLUReasoning Methods CoT ToTDiscourses

Abstract reasoning is a key ability for an intelligent system. Large language models (LMs) achieve above-chance performance on abstract reasoning tasks, but exhibit many imperfections. However, human abstract reasoning is also imperfect. For example, human reasoning is affected by our real-world knowledge and beliefs, and shows notable “content effects”; humans reason more reliably when the semantic content of a problem supports the correct logical inferences. These content-entangled reasoning patterns play a central role in debates about the fundamental nature of human intelligence. Here, we investigate whether language models — whose prior expectations capture some aspects of human knowledge—similarly mix content into their answers to logical problems. We explored this question across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task (1). We evaluate state of the art large language models, as well as humans, and find that the language models reflect many of the same patterns observed in humans across these tasks—like humans, models answer more accurately when the semantic content of a task supports the logical inferences. These parallels are reflected both in answer patterns, and in lower-level features like the relationship between model answer distributions and human response times. Our findings have implications for understanding both these cognitive effects in humans, and the factors that contribute to language model performance.

A hallmark of abstract reasoning is the ability to systematically perform algebraic operations over variables that can be bound to any entity (2, 3): the statement: ‘X is bigger than Y’ logically implies that ‘Y is smaller than X’, no matter the values of X and Y. That is, abstract reasoning is ideally content-independent (2). The capacity for reliable and consistent abstract reasoning is frequently highlighted as a crucial missing component of current AI (4, 5, 6). For example, while large Language Models (LMs) exhibit some impressive emergent behaviors, including some performance on abstract reasoning tasks (7, 8, 9, 10, 11; though cf. 12), they have been criticized for failing to achieve systematic consistency in their abstract reasoning (e.g. 13, 14, 15, 16).

However, humans — arguably the best known instances of general intelligence — are far from perfectly rational abstract reasoners (17, 18, 19). Patterns of biases in human reasoning have been studied across a wide range of tasks and domains (18). Here, we focus in particular on ‘content effects’—the finding that humans are affected by the semantic content of a logical reasoning problem. In particular, humans reason more readily and more accurately about familiar, believable, or grounded situations, compared to unfamiliar, unbelievable, or abstract ones.

For example, when presented with a syllogism like the following:

All students read.

Some people who read also write essays.

Therefore some students write essays.

humans will often classify it as a valid argument. However, when presented with:

All students read.

Some people who read are professors.

Therefore some students are professors.

humans are much less likely to say it is valid (20, 21, 22)—despite the fact that the arguments above are logically equivalent (both are invalid). Similarly, humans struggle to reason about how to falsify conditional rules involving abstract attributes (1, 23), but reason more readily about logically-equivalent rules grounded in realistic situations (24, 25, 26). This human tendency also extends to other forms of reasoning e.g. probabilistic reasoning, where humans are notably worse when problems do not reflect intuitive expectations (27).

The literature on human cognitive biases is extensive, but many of these biases can be idiosyncratic and context-dependent. For example, even some of the seminal findings in the influential work of Kahneman et al. (18), like ‘base rate neglect’, are sensitive to context and experimental design (28, 29), with several studies demonstrating exactly the opposite effect in a different context (30). However, the content effects on which we focus have been a notably consistent finding and have been documented in humans across different reasoning tasks and domains: deductive and inductive, or logical and probabilistic (31, 1, 32, 33, 20, 27). This ubiquity is notable and makes these effects harder to explain as idiosyncracies. This ubiquitous sensitivity to content is in direct contradiction with the definition of abstract reasoning: that it is independent of content, and speaks directly to longstanding debates over the fundamental nature of human intelligence: are we best described as algebraic symbol-processing systems (2, 34), or emergent connectionist ones (35, 36) whose inferences are grounded in learned semantics? Yet explanations or models of content effects in the psychological sciences often focus on a single (task and content-specific) phenomenon and invoke bespoke mechanisms that only apply to these specific settings (e.g. 25). Could content effects be explained more generally? Could they emerge from simple learning processes over naturalistic data?

In this work, we address these questions, by examining whether language models show this human-like blending of logic with semantic content effects. Language models possess prior knowledge — expectations over the likelihood of particular sequences of tokens — that are shaped by their training. Indeed, the goal of the “pre-train and adapt” or the “foundation models” (37) paradigm is to endow a model with broadly accurate prior knowledge that enables learning a new task rapidly. Thus, language model representations often reflect human semantic cognition; e.g., language models reproduce patterns like association and typicality effects (38, 39), and language model predictions can reproduce human knowledge and beliefs (40, 41, 42, 43). In this work, we explore whether this prior knowledge impacts a language model’s performance in logical reasoning tasks. While a variety of recent works have explored biases and imperfections in language models’ performance (e.g. 13, 14, 15, 44, 16), we focus on the specific question of whether content interacts with logic in these systems as it does in humans. This question has significant implications not only for characterizing LMs, but potentially also for understanding human cognition, by contributing new ways of understanding the balance, interactions, and trade-offs between the abstract and grounded capabilities of a system.

We explore how the content of logical reasoning problems affects the performance of a range of large language models (45, 46, 47). To avoid potential dataset contamination, we create entirely new datasets using designs analogous to those used in prior cognitive work, and we also collect directly-comparable human data with our new stimuli. We find that language models reproduce human content effects across three different logical reasoning tasks (Fig. 1). We first explore a simple Natural Language Inference (NLI) task, and show that models and humans answer fairly reliably, with relatively modest influences of content. We then examine the more challenging task of judging whether a syllogism is a valid argument, and show that models and humans are biased by the believability of the conclusion. We finally consider realistic and abstract/arbitrary versions of the Wason selection task (1) — a task introduced over 50 years ago that demonstrates a failure of systematic human reasoning — and show that models and humans perform better with a realistic framing. Our findings with human participants replicate and extend existing findings in the cognitive literature. We also report novel analyses of itemlevel effects, and the effect of content and items on continuous measures of model and human responses. We close with a discussion of the implications of these findings for understanding human cognition as well as language models.

Evaluating content effects on logical tasks

In this work, we evaluate content effects on three logical reasoning tasks, which are depicted in Fig. 1. These three tasks involve different types of logical inferences, and different kinds of semantic content. However, these distinct tasks admit a consistent definition of content effects: the extent to which reason is facilitated in situations in which the semantic content supports the correct logical inference, and correspondingly the extent to which reasoning is harmed when semantic content conflicts with the correct logical inference (or, in the Wason tasks, when the content is simply arbitrary). We also evaluate versions of each task where the semantic content is replaced with nonsense non-words, which lack semantic content and thus should neither support nor conflict with reasoning performance. (However, note that in some cases, particularly the Wason tasks, changing to nonsense content requires more substantially altering the kinds of inferences required in the task; see Methods.)

Natural Language Inference

The first task we consider has been studied extensively in the natural language processing literature (48). In the classic Natural Language Inference (NLI) problem, a model receives two sentences, a ‘premise’ and a ‘hypothesis’, and has to classify them based on whether the hypothesis ‘entails’, ‘contradicts’, or ‘is neutral to’ the premise. Traditional datasets for this task were crowd-sourced (49) leading to sentence pairs that don’t strictly follow logical definitions of entailment and contradiction. To make this a more strictly logical task, we follow Dasgupta et al. (50) to generate comparisons (e.g. X is smaller than Y). We then give participants an incomplete inference such as “If puddles are bigger than seas, then...” and give them a forced choice between two possible hypotheses to complete it: “seas are bigger than puddles” and “seas are smaller than puddles.” Note that one of these completions is consistent with real-world semantic beliefs i.e. ‘believable’ while the other is logically consistent with the premise but contradicts real world beliefs. We can then evaluate whether models and humans answer more accurately when the logically correct hypothesis is believable; that is, whether the content affects their logical reasoning.

However, content effects are generally more pronounced in difficult tasks that require extensive logical reasoning (33, 21). We therefore consider two more challenging tasks where human content effects have been observed in prior work.

Syllogisms

Syllogisms (51) are a simple argument form in which two true statements necessarily imply a third. For example, the statements “All humans are mortal” and “Socrates is a human” together imply that “Socrates is mortal”. But human syllogistic reasoning is not purely abstract and logical; instead it is affected by our prior beliefs about the contents of the argument (20, 22, 52). For example, Evans et al. (20) showed that if participants were asked to judge whether a syllogism was logically valid or invalid, they were biased by whether the conclusion was consistent with their beliefs. Participants were very likely (90% of the time) to mistakenly say an invalid syllogism was valid if the conclusion was believable, and thus mostly relied on belief rather than abstract reasoning. Participants would also sometimes say that a valid syllogism was invalid if the conclusion was not believable, but this effect was somewhat weaker (but cf. 53). These “belief-bias” effects have been replicated and extended in various subsequent studies (22, 53, 54, 55). We similarly evaluate whether models and humans are more likely to endorse an argument as valid if its conclusion is believable, or to dismiss it as invalid if its conclusion is unbelievable.

The Wason Selection Task

The Wason Selection Task (1) is a logic problem that can be challenging even for subjects with substantial education in mathematics or philosophy. Participants are shown four cards, and told a rule such as: “if a card has a ‘D’ on one side, then it has a ‘3’ on the other side.” The four cards respectively show ‘D’, ‘F’, ‘3’, and ‘7’. The participants are then asked which cards they need to flip over to check if the rule is true or false. The correct answer is to flip over the cards showing ‘D’ and ‘7’. However, Wason (1) showed that while most participants correctly chose ‘D’, they were much more likely to choose ‘3’ than ‘7’. That is, the participants should check the contrapositive of the rule (“not 3 implies not D”, which is logically implied), but instead they confuse it with the converse (“3 implies D”, which is not logically implied). This is a classic task in which reasoning according to the rules of formal logic does not come naturally for humans, and thus there is potential for prior beliefs and knowledge to affect reasoning.

Indeed, the difficulty of theWason task depends upon the content of the problem. Past work has found that if an identical logical structure is instantiated in a common situation, particularly a social rule, participants are much more accurate (56, 24, 25, 26). For example, if participants are told the cards represent people, and the rule is “if they are drinking alcohol, then they must be 21 or older” and the cards show ‘beer’, ‘soda’, ‘25’, ‘16’, then many more participants correctly choose to check the cards showing ‘beer’ and ‘16’. We therefore similarly evaluate whether language models and humans are facilitated in reasoning about realistic rules, compared to more-abstract arbitrary ones. (Note that in our implementations of the Wason task, we forced participants and language models to choose exactly two cards, in order to most closely match answer formats between the humans and language models.)

The extent of content effects on theWason task are also affected by background knowledge; education in mathematics appears to be associated with improved reasoning in abstract Wason tasks (57, 58). However, even those experienced participants were far from perfect — undergraduate mathematics majors and academic mathematicians achieved less than 50% accuracy at the arbitrary Wason task (57). This illustrates the challenge of abstract logical reasoning, even for experienced humans. As we will see in the next section, many human participants did struggle with the abstract versions of our tasks.