Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generalization or rely on shallow heuristics when given CoT prompts. To understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers (Andress, 2014), where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs—GPT-4, Claude 3, and Llama 3.1— performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify three factors that systematically affect CoT performance: the probability of the task’s expected output (probability), what the model has implicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output’s probability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning.†
We present an extensive case study on a single task that allows us to disentangle reasoning from memorization.
First, the accuracy of CoT is affected by the probability of the correct output, with more probable outputs resulting in a stronger effect of CoT. Second, performance is higher when memorization is possible, as indicated by the frequency of encountering different shift cipher variants during pre-training. The effects of probability and memorization show that CoT performance is not fully systematic abstract reasoning. Nonetheless, CoT performance is not solely driven by superficial heuristics: it also shows some hallmarks of true reasoning—albeit a noisy version of true reasoning, in which the error rate increases along with task difficulty (where we quantify task difficulty by the number of implicit reasoning steps involved). In addition, we find evidence that the effect of CoT fundamentally depends on generating sequences of words that increase the probability of the correct answer when conditioned upon; as long as this is the case, CoT can thus succeed even when the demonstrations in the prompt are invalid. In the ongoing debate about whether LLMs reason or memorize (Feldman, 2020; Zhang et al., 2023a; Magar and Schwartz, 2022; Srivastava et al., 2024; Antoniades et al., 2024), our results thus support a reasonable middle-ground: LLM behavior displays aspects of both memorization and reasoning, and also reflects the probabilistic origins of these models.