Where do memorization errors arise in chain-of-thought reasoning?
Explores whether memorization in language model reasoning can be localized to specific token sources and which sources dominate error patterns during long generations.
STIM (Source-aware Token-level Identification of Memorization) argues that memorization in long CoT generations must be identified at the token level, not the sequence level. A single faulty token — produced by memorization rather than reasoning — can trigger cascading errors through subsequent steps. Existing metrics report a single score for the entire sequence, missing where and why individual tokens go wrong.
Three distinct memorization sources influence each token:
Local memorization — frequent continuations of immediately preceding tokens. The model generates the next token based on statistical co-occurrence with its local context, not reasoning. This is the dominant error source, responsible for up to 67% of wrong tokens.
Mid-range memorization — tokens that frequently co-occur with the generation prefix. The model has seen this pattern in pretraining and reproduces it, even when the current reasoning context requires a different continuation.
Long-range memorization — frequent co-occurrence with tokens in the input prompt. The prompt triggers a familiar pattern from pretraining that overrides the reasoning chain.
Key distributional findings:
- Complexity increases memorization. As reasoning complexity increases, models rely more on memorization — they fall back on familiar patterns when the reasoning becomes harder.
- Distributional shift increases memorization. Moving toward rare or atypical inputs strengthens memorization signals. The model has less training experience to draw on, so it relies more on pattern-matching from similar-but-not-identical training examples.
- Base vs long-tail reversal. In base settings, memorization often supports correct answers (familiar patterns lead to right conclusions). In long-tail scenarios, the same memorization mechanisms drive errors — defective recall when faced with unfamiliar contexts.
This connects to the broader reasoning trace reliability cluster. Since Which sentences actually steer a reasoning trace?, STIM adds a complementary mechanism: specific tokens at the sub-sentence level carry memorization-driven influence that can derail even well-structured reasoning chains. The failure is more granular than thought-level — it operates at individual tokens.
The practical implication: high memorization scores are strong indicators of reasoning failures (measured via Precision@k and Recall@k). This offers a potential diagnostic tool for identifying where reasoning chains are unreliable, independent of whether the final answer is correct. This diagnostic capability directly addresses the faithfulness problem: since Do language models actually use their reasoning steps?, STIM's memorization scores provide a token-level mechanism for faithfulness failure — memorized tokens are causally unnecessary (the answer was determined by pattern-matching, not reasoning) and causally insufficient (the memorized continuation may diverge from the reasoning the chain appears to perform).
Source: Memory
Related concepts in this collection
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
complementary granularity: thought anchors operate at sentence level, STIM at token level
-
Do only 20 percent of tokens actually matter for reasoning?
Chain-of-thought reasoning might depend on a small minority of high-entropy tokens that act as decision points. If true, could training focus only on these critical tokens match or exceed full-gradient updates?
both identify sparse tokens with disproportionate influence; STIM adds the memorization-source dimension
-
Do reasoning traces need to be semantically correct?
Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
corrupted traces may work BECAUSE they break local memorization patterns, forcing the model into generalization mode
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
local memorization provides the mechanism: the model reproduces familiar reasoning patterns rather than deriving new ones
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
STIM provides the token-level mechanism: memorized tokens are neither causally sufficient nor necessary for reasoning
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
token-level memorization in CoT reasoning has three distinct sources and local memorization causes up to 67 percent of reasoning errors