LLM Reasoning and Architecture

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

Note · 2026-02-22 · sourced from Reasoning Logic Internal Rules
What makes chain-of-thought reasoning actually work? How should researchers navigate LLM reasoning research? Do reasoning traces show how models actually think?

The "Deciphering Factors Influencing CoT" paper achieves something rare: a clean decomposition of what drives Chain-of-Thought performance into three independently measurable factors, using the simple but controlled task of shift cipher decoding across GPT-4, Claude 3, and Llama 3.1.

Factor 1: Output probability. The probability of the correct output in the model's distribution dramatically affects CoT accuracy. Varying only the output's probability of occurrence shifts GPT-4 accuracy from 26% to 70%. CoT works better when the answer is already more probable — it amplifies existing tendencies rather than overcoming them.

Factor 2: Memorization. Performance is higher when the specific cipher variant was more frequently encountered during pre-training. This is not reasoning — it is pattern matching against memorized instances. The frequency of encountering different shift values in training data directly predicts accuracy on those shifts.

Factor 3: Noisy reasoning. After controlling for probability and memorization, genuine reasoning effects remain — but they are noisy. Error rate increases with the number of implicit reasoning steps (shift magnitude). This is real multi-step reasoning, but each step introduces error probability, so accuracy degrades with chain length.

The decomposition resolves the ongoing debate about whether LLMs reason or memorize: they do both, simultaneously, and the contribution of each factor varies by task. This supports Does chain-of-thought reasoning reveal genuine inference or pattern matching? while adding a crucial nuance: the imitation IS partially genuine, but contaminated by probability bias and memorization artifacts.

The probability factor is particularly important for understanding CoT faithfulness. Since Do language models actually use their reasoning steps?, the probability dependence reveals a specific mechanism for causal insufficiency: CoT "reasoning" succeeds partly because the sequence of generated tokens increases the conditional probability of the correct answer, not because the logical content is being processed. This is exactly the mechanism behind Does logical validity actually drive chain-of-thought gains? — invalid exemplars work because they still generate token sequences that shift output probability toward correct answers.

The noisy-reasoning factor connects to Does more thinking time always improve reasoning accuracy?: if each reasoning step adds noise, then past some threshold the accumulated noise exceeds the signal, producing the inverted-U performance curve.


Source: Reasoning Logic Internal Rules

Related concepts in this collection

Concept map
14 direct connections · 107 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

cot performance reflects three disentangled factors — output probability memorization and noisy reasoning