Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

Paper · arXiv 2410.21333 · Published October 27, 2024
Reasoning CritiquesReasoning o1 o3 Search

Chain-of-thought (Wei et al., 2022; Nye et al., 2021) is a widely used prompting technique for large language and multimodal models (LLMs and LMMs), instructing models to “think step-by-step” or providing other structure that should be incorporated into their response. Large meta-studies have shown that this technique improves the performance of models on many tasks, particularly those involving symbolic reasoning (Sprague et al., 2024). More generally, inference-time reasoning has become a default component of the newest LLMs and LMMs such as OpenAI o1-preview (OpenAI, 2024a) and Claude’s web interface and mobile apps (Anthropic, 2024). However, there also exist cases where CoT decreases performance, but there have not been any identified patterns as to when this happens. With the increasing use of inference-time reasoning in deployed models, it is imperative to understand and predict when CoT has a negative effect on model performance.

An important caveat is that models and humans have fundamentally different capabilities and consequently have different constraints affecting their performance (Griffiths, 2020; Shiffrin & Mitchell, 2023; McCoy et al., 2024). For example, LLMs have long context lengths that far exceed human memory limitations. Despite this, we believe that for some failure cases the tasks themselves, in conjunction with traits shared between humans and models, are responsible for thinking having a negative effect on performance. Thus, we anticipate that CoT will hurt model performance in cases where (i) deliberation hurts performance in humans, and (ii) the constraints governing human performance on the task generalize to LLMs.

To explore this approach, we draw on the cognitive psychology literature on cases where engaging in verbal thinking hurts human task performance (Schooler & Engstler-Schooler, 1990; Dijksterhuis, 2004; Van den Bos & Poletiek, 2008, inter alia).

A basic prerequisite for seeing a negative impact from CoT is that zero-shot prompting produces reasonable performance. Thus, a logical reasoning task where human judgments are worse after deliberation was not a candidate for a negative effect of CoT because zero-shot prompting resulted in models that were unable to score above chance.

For example, recent studies that leverage insights or datasets from psychology have evaluated the representational capacity of LLMs (Frank, 2023), explored how RLHF and CoT lead to different outcomes when trying to make models both helpful and honest (Liu et al., 2024a), compared human and machine representations via similarity judgments (Peterson et al., 2018; Marjieh et al., 2023a;b; 2024a), determined that LLMs over-estimate human rationality in decisions (Liu et al., 2024b), identified incoherence in LLM probability judgments (Zhu & Griffiths, 2024), identified susceptibility to linguistic illusions in LLMs (Marjieh et al., 2024b), uncovered LLMs’ underlying social biases (Bai et al., 2024), used storytelling to understand episodic memory in LLMs (Cornell et al., 2023), constructed prompts using psychological theories of metaphor (Prystawski et al., 2022), discovered cross-linguistic variability in LLM representations (Niedermann et al., 2023), and probed the roles of language and vision for in-context learning in VLMs (Chen et al., 2024). Many of these studies start with a well-studied phenomenon in human cognition and then explore whether there is an isomorphic analog to it in LLMs or LMMs. Our work follows this established line of literature by associating the well-studied impact of deliberation on human performance to effects that occur in models using CoT.