What three separate factors drive chain-of-thought performance?
Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.
The "Deciphering Factors Influencing CoT" paper achieves something rare: a clean decomposition of what drives Chain-of-Thought performance into three independently measurable factors, using the simple but controlled task of shift cipher decoding across GPT-4, Claude 3, and Llama 3.1.
Factor 1: Output probability. The probability of the correct output in the model's distribution dramatically affects CoT accuracy. Varying only the output's probability of occurrence shifts GPT-4 accuracy from 26% to 70%. CoT works better when the answer is already more probable — it amplifies existing tendencies rather than overcoming them.
Factor 2: Memorization. Performance is higher when the specific cipher variant was more frequently encountered during pre-training. This is not reasoning — it is pattern matching against memorized instances. The frequency of encountering different shift values in training data directly predicts accuracy on those shifts.
Factor 3: Noisy reasoning. After controlling for probability and memorization, genuine reasoning effects remain — but they are noisy. Error rate increases with the number of implicit reasoning steps (shift magnitude). This is real multi-step reasoning, but each step introduces error probability, so accuracy degrades with chain length.
The decomposition resolves the ongoing debate about whether LLMs reason or memorize: they do both, simultaneously, and the contribution of each factor varies by task. This supports Does chain-of-thought reasoning reveal genuine inference or pattern matching? while adding a crucial nuance: the imitation IS partially genuine, but contaminated by probability bias and memorization artifacts.
The probability factor is particularly important for understanding CoT faithfulness. Since Do language models actually use their reasoning steps?, the probability dependence reveals a specific mechanism for causal insufficiency: CoT "reasoning" succeeds partly because the sequence of generated tokens increases the conditional probability of the correct answer, not because the logical content is being processed. This is exactly the mechanism behind Does logical validity actually drive chain-of-thought gains? — invalid exemplars work because they still generate token sequences that shift output probability toward correct answers.
The noisy-reasoning factor connects to Does more thinking time always improve reasoning accuracy?: if each reasoning step adds noise, then past some threshold the accumulated noise exceeds the signal, producing the inverted-U performance curve.
Source: Reasoning Logic Internal Rules
Related concepts in this collection
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
nuances: imitation IS partially genuine, but noisy and probability-contaminated
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
probability dependence is a specific causal insufficiency mechanism
-
Does logical validity actually drive chain-of-thought gains?
What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.
probability mechanism explains why invalid exemplars work
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
noisy reasoning factor predicts the inverted-U: accumulated noise exceeds signal
-
Do reasoning traces need to be semantically correct?
Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
the probability factor explains why corrupted traces still work: intermediate tokens shift output probability toward correct answers regardless of their semantic content; the "genuine reasoning" factor is only one of three contributors
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
aligns with the three-factor decomposition: structural coherence provides the scaffolding for the probability and noisy-reasoning factors to operate, while content correctness maps only to the memorization factor
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
cot performance reflects three disentangled factors — output probability memorization and noisy reasoning