LLM Reasoning and Architecture Reinforcement Learning for LLMs

Where do memorization errors arise in chain-of-thought reasoning?

Explores whether memorization in language model reasoning can be localized to specific token sources and which sources dominate error patterns during long generations.

Note · 2026-02-23 · sourced from Memory
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

STIM (Source-aware Token-level Identification of Memorization) argues that memorization in long CoT generations must be identified at the token level, not the sequence level. A single faulty token — produced by memorization rather than reasoning — can trigger cascading errors through subsequent steps. Existing metrics report a single score for the entire sequence, missing where and why individual tokens go wrong.

Three distinct memorization sources influence each token:

  1. Local memorization — frequent continuations of immediately preceding tokens. The model generates the next token based on statistical co-occurrence with its local context, not reasoning. This is the dominant error source, responsible for up to 67% of wrong tokens.

  2. Mid-range memorization — tokens that frequently co-occur with the generation prefix. The model has seen this pattern in pretraining and reproduces it, even when the current reasoning context requires a different continuation.

  3. Long-range memorization — frequent co-occurrence with tokens in the input prompt. The prompt triggers a familiar pattern from pretraining that overrides the reasoning chain.

Key distributional findings:

This connects to the broader reasoning trace reliability cluster. Since Which sentences actually steer a reasoning trace?, STIM adds a complementary mechanism: specific tokens at the sub-sentence level carry memorization-driven influence that can derail even well-structured reasoning chains. The failure is more granular than thought-level — it operates at individual tokens.

The practical implication: high memorization scores are strong indicators of reasoning failures (measured via Precision@k and Recall@k). This offers a potential diagnostic tool for identifying where reasoning chains are unreliable, independent of whether the final answer is correct. This diagnostic capability directly addresses the faithfulness problem: since Do language models actually use their reasoning steps?, STIM's memorization scores provide a token-level mechanism for faithfulness failure — memorized tokens are causally unnecessary (the answer was determined by pattern-matching, not reasoning) and causally insufficient (the memorized continuation may diverge from the reasoning the chain appears to perform).


Source: Memory

Related concepts in this collection

Concept map
14 direct connections · 130 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

token-level memorization in CoT reasoning has three distinct sources and local memorization causes up to 67 percent of reasoning errors