Thought Anchors: Which LLM Reasoning Steps Matter?

Paper · arXiv 2506.19143 · Published June 23, 2025

We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method measuring each sentence’s counterfactual importance by comparing final answers across 100 rollouts conditioned on the model generating that sentence or one with a different meaning; (2) a white-box method of aggregating attention patterns between pairs of sentences, which identified “broadcasting” sentences that receive disproportionate attention from all future sentences via “receiver” attention heads; (3) a causal attribution method measuring logical connections between sentences by suppressing attention toward one sentence and measuring the effect on each future sentence’s tokens. Each method provides evidence for the existence of thought anchors, reasoning steps that have outsized importance and that disproportionately influence the subsequent reasoning process. These thought anchors are typically planning or backtracking sentences. We provide an open-source tool (thought-anchors.com) for visualizing the outputs of our methods, and present a case study showing converging patterns across methods that map how a model performs multi-step reasoning.

mechanistic interpretability [25, 24] methods often focus on a single forward pass of the model: understanding the activations, how they are processed by each layer, and how they are converted to the final output

A natural decomposition for chain-of-thought is into individual sentences and how they depend on each other. Interpretations of neural network behavior operate at varying levels of abstraction [11, 12], and sentence-level explanations strike an intermediate abstraction depth. Compared to tokens, sentences are more coherent and often coincide with reasoning steps extracted by an LLM [31, 2]. Compared to paragraphs, sentences are less likely to conflate reasoning steps and may serve as an effective target for linking different steps.

Prior work has established that different sentences within reasoning traces perform distinct functions. Backtracking sentences (e.g., “Wait. . . ”) cause the model to revisit earlier conclusions, which boosts final-answer accuracy [22]. Other research has distinguished sentences based on whether they retrieve new information or execute deduction with existing information [31]. Hence, reasoning may follow an overarching structure, where sentences can introduce and pursue high-level computational goals.

We propose three complementary methods for mapping the structure of reasoning traces that focus on what we term thought anchors: critical reasoning steps that guide the rest of the reasoning trace. We provide evidence for this type of anchoring based on black-box evidence from resampling and white-box evidence based on attention patterns.

First, in section 3 we present a black-box method for measuring the counterfactual impact of a sentence on the model’s final answer and future sentences. We repeatedly resample reasoning traces from the start of each sentence. Based on resampling data, we can quantify the impact of each sentence on the likelihood of any final answer or the likelihood of producing any subsequent sentence. Via this resampling approach, we can additionally distinguish planning sentences that initiate computations leading to some answer from sentences performing computations necessary for the answer but which are predetermined.

Second, in section 4 we present a white-box method for evaluating importance based on the sentences most attended. Our analyses reveal “receiver” heads that narrow attention toward particular past “broadcasting” sentences. Compared to base models, where attention is more diffuse, reasoning models display overall greater attentional narrowing through receiver heads, and these heads have an outsized impact on the model’s final answer. We develop a systematic approach to identifying receiver heads and show how evaluating sentences on the extent to which they are broadcast by these heads provides a mechanistic measure of importance.

Third and finally, in section 5 we present a method that measures the causal dependency between specific pairs of sentences in a reasoning trace. For each sentence in a trace, we intervene by masking all attention to it from subsequent tokens. We then measure the effect of suppression on subsequent token logits (KL divergence) compared to those generated during the absence of suppression. Averaging token effects by sentence, this strategy measures each sentence’s direct causal effect on each subsequent sentence.

We refer to attention heads that narrow attention toward specific sentences as “receiver heads”.

This can be seen as a tentative CoT circuit, where two conclusions conflict to produce a discrepancy, which in turn encourages the model to resolve the discrepancy

This work presents initial steps towards a principled decomposition of reasoning traces with a focus on identifying thought anchors: sentences with outsized importance on the model’s final response, specific future sentences, and downstream reasoning trajectory. We have also begun unpacking the attentional mechanisms associated with these important sentences. We expect that understanding thought anchors will be critical for interpreting reasoning models and ensuring their safety.

analyses require refinement to fully grapple with how downstream sentences may be overdetermined by different trajectories in a reasoning trace or independent sufficient causes