LLM Reasoning and Architecture Language Understanding and Pragmatics Reinforcement Learning for LLMs

Which sentences actually steer a reasoning trace?

Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.

Note · 2026-02-22 · sourced from Reasoning Methods CoT ToT

Mechanistic interpretability of reasoning traces typically focuses on token-level activations. The "Thought Anchors" paper takes a sentence-level approach, arguing that sentences are a more coherent unit for understanding reasoning than tokens but more granular than paragraphs.

Three complementary methods are applied to the same reasoning traces:

Counterfactual resampling (black-box): For each sentence, resample 100 completions conditioned on that sentence being present vs. replaced with a different-meaning sentence. Sentences that significantly shift the final answer distribution have high counterfactual importance.
Attention pattern analysis (white-box): Identify "receiver heads" — attention heads that narrow focus toward specific past sentences. Sentences that are heavily broadcast by receiver heads are mechanistically central to downstream computation.
Causal suppression (white-box): Mask attention toward each sentence from subsequent tokens. Measure KL divergence effect on subsequent token distributions. Sentences whose suppression has large downstream effects are causally active.

All three methods converge on the same subset of sentences: planning sentences (establishing the direction of reasoning) and backtracking sentences ("Wait...", "Actually...", error-correction steps). These are the thought anchors — sentences that disproportionately guide what comes after.

The finding that backtracking sentences are thought anchors extends Why do correct reasoning traces contain fewer tokens? and Do hedging markers actually signal careful thinking in AI?. Backtracking is not mere noise — it is a functional pivot. A backtracking sentence recognized as a thought anchor shifts the entire subsequent reasoning trajectory.

This also reveals why receiver heads in reasoning models are more narrowly focused than in base models: the reasoning-trained model has learned to weight certain past sentences more heavily as guides for subsequent generation. This attentional specialization is the mechanistic signature of structured reasoning.

Practical implication: if you want to evaluate whether a reasoning trace is doing real work, identify the thought anchors. If you want to steer reasoning, these are the leverage points. The anchors are not uniformly distributed — sparse critical sentences dominate.

Information-theoretic confirmation (MI Peaks): The "Demystifying Reasoning Dynamics with Mutual Information" paper provides a fourth convergent method. By tracking mutual information (MI) between intermediate representations and the correct answer across reasoning steps, they find MI peaks — positions where information about the correct answer suddenly spikes. These peaks are sparse and non-uniformly distributed. Crucially, MI peaks correspond to the same class of tokens identified as thought anchors: reflection tokens ("Wait," "Hmm"), transition tokens ("Therefore," "So"), and self-correction tokens. Suppressing these thinking tokens significantly degrades reasoning performance, while suppressing the same number of random tokens has minimal impact. The paper also proposes Representation Recycling (RR) — allowing representations at MI peaks to undergo multiple iterations through the model — which improves accuracy up to 20% on hard benchmarks. This is the first technique that directly exploits thought anchor identification for performance improvement. See Do reflection tokens carry more information about correct answers?.

Token-level memorization sources (STIM, 2508.02037): The STIM framework adds a fourth convergent method at the token level — identifying three distinct sources of memorization that cause reasoning errors: (1) local memorization from frequent continuations of immediately preceding tokens (dominant error source, up to 67% of wrong tokens), (2) mid-range memorization from co-occurrence with generation prefix, and (3) long-range memorization from co-occurrence with prompt tokens. Under distributional shift toward rare inputs, all three sources intensify. High STIM memorization scores predict erroneous tokens with high Precision@k and Recall@k. This adds a complementary mechanism to the thought anchor framework: while thought anchors identify which sentences are structurally important (planning/backtracking), STIM identifies which tokens within those sentences are driven by memorization rather than reasoning. A thought anchor sentence could contain tokens that are mechanistically pivotal AND memorization-driven — explaining why structurally important reasoning steps can nevertheless produce errors. See Where do memorization errors arise in chain-of-thought reasoning?.

Token-level mechanistic refinement: The "Beyond 80/20" RLVR analysis provides a finer-grained version of the same insight at the token level. High-entropy minority tokens — the ~20% of tokens where the model's probability distribution is most uncertain — are the critical forking points where RLVR's gradient signal is concentrated. Restricting gradient updates to only these tokens matches or exceeds full updates. These high-entropy tokens are the token-level analog of sentence-level thought anchors: both identify sparse critical junctures where reasoning trajectory can diverge. The convergence across levels of analysis (tokens, sentences) reinforces that reasoning traces have a sparse-pivot structure at multiple granularities. See Do only 20 percent of tokens actually matter for reasoning?.

Source: Reasoning Methods CoT ToT, RLVR, Memory

Related concepts in this collection

Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
thought anchors are the steps where causal necessity can be tested directly: suppress the anchor, measure the effect
Why do correct reasoning traces contain fewer tokens? In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
thought anchors may explain why shorter traces are more accurate: fewer non-anchor steps means higher anchor density; less noise around the critical pivots
Do hedging markers actually signal careful thinking in AI? Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
backtracking sentences are a class of hedging; the thought anchor finding clarifies their function: they are pivots, not mere markers of uncertainty
Do reasoning traces actually cause correct answers? Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
thought anchor analysis offers a path toward verifying traces: mechanistic anchor identification does not rely on the model's self-report
Do only 20 percent of tokens actually matter for reasoning? Chain-of-thought reasoning might depend on a small minority of high-entropy tokens that act as decision points. If true, could training focus only on these critical tokens match or exceed full-gradient updates?
token-level analog: high-entropy forking tokens are the sub-sentence version of thought anchors
Can models learn to plan without changing their architecture? Explores whether embedding future information directly into training data can teach language models to plan and reason about goals, without modifying the underlying neural architecture or training algorithms.
thought anchors (especially planning sentences) may be the behavioral manifestation of goal conditioning: the model self-generates planning sentences that function as lookahead tokens, conditioning subsequent generation on anticipated goals; TRELAWNEY trains this capacity explicitly
Does failed-step fraction predict reasoning quality better? Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.
the negative counterpart to thought anchors: FSF measures how much failed exploration contaminates the context, while thought anchors identify the successful pivot points — together they define the structural quality of a reasoning trace
Do reasoning cycles in hidden states reveal aha moments? What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.
hidden-state topology confirms at the representation level what thought anchors identify at the sentence level: backtracking sentences create the cycles in reasoning graphs, planning sentences extend diameter; the convergence across granularities (token, sentence, hidden-state graph) reinforces the sparse-pivot structure of reasoning
What mechanism enables models to retrieve from long context? Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?
retrieval heads are the mechanistic substrate enabling attention to thought anchors during CoT: the sparse <5% of attention heads that retrieve information from earlier context are what allows planning and backtracking sentences to exert downstream causal influence
How do language models perform syllogistic reasoning internally? Does formal symbolic reasoning exist as a distinct neural circuit in LLMs, or is it inevitably contaminated by world knowledge associations? Understanding the mechanism could reveal whether pure logical reasoning is separable from semantic inference.
both findings demonstrate that reasoning has a sparse mechanistic structure: syllogistic circuits identify a three-stage process where specific attention heads perform suppression and mediation, while thought anchors identify the sentence-level pivots where those circuits concentrate their influence; the recitation stage (attending to premise information) is mechanistically enabled by the same attentional selectivity that makes some sentences into anchors
Can intermediate reasoning points yield better answers than final ones? When reasoning models commit to a single path, they may miss better conclusions available at earlier decision points. Can aggregating completions from intermediate reasoning states recover lost accuracy?
practical exploitation of thought anchor locations: subthought aggregation branches from transition points in the trace (where thought anchors cluster) and recovers answers 13% more accurate than the final answer; thought anchors explain WHY these branching points are productive — they are the causal pivot points where path commitment has the most downstream consequence

Concept map

26 direct connections · 168 in 2-hop network ·medium cluster

Which sentences actually steer a reasoning trace… Do language models actually use their reasoning st… Why do correct reasoning traces contain fewer toke… Do hedging markers actually signal careful thinkin… Do reasoning traces actually cause correct answers… Do only 20 percent of tokens actually matter for r… Can models learn to plan without changing their ar… Does failed-step fraction predict reasoning qualit… Do reasoning cycles in hidden states reveal aha mo…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

thought anchors are planning and backtracking sentences with disproportionate causal influence on reasoning traces