Do shorter correct reasoning traces contain more thought anchors than longer ones?
This explores whether the sparse 'planning and backtracking' sentences that actually steer a reasoning trace are packed more densely into short correct traces than into long ones — and the corpus has the pieces to answer it even though no single note measures anchor-density-by-length directly.
This explores whether the handful of sentences that genuinely steer a reasoning trace are concentrated more tightly in short correct answers than in sprawling ones. No note measures this ratio head-on, but two findings, read against each other, point somewhere counterintuitive. First, in o1-style models correct traces are reliably *shorter* than incorrect ones, and the reason isn't economy for its own sake — longer traces accumulate self-revisions, and those revisions introduce and compound errors rather than repair them Why do correct reasoning traces contain fewer tokens?. Second, the sentences that causally guide a trace — its 'thought anchors' — are specifically *planning and backtracking* sentences, identified independently by counterfactual resampling, attention analysis, and causal suppression Which sentences actually steer a reasoning trace?.
Here's the tension worth noticing: backtracking is itself an anchor category. So a long trace, full of self-revision, is in one sense full of anchor-type sentences — but they're the destructive kind. A short correct trace gets where it's going by hitting its planning pivots and committing, with little backtracking to walk things back. That suggests the honest answer is split by *which* anchor you mean. Short correct traces likely have a higher density of load-bearing *planning* anchors relative to filler — they're mostly pivot, little padding. Long traces have more *total* anchors, but the extra ones are backtracking moves that the shorter-is-better result tells us are doing harm, not work.
The broader corpus complicates even this. Length itself is a slippery proxy: trace length tracks how close a problem sits to the training distribution, not how hard it is or how much computation it deserves Does longer reasoning actually mean harder problems?, and accuracy follows an inverted-U where models overthink easy problems past a token threshold Does more thinking time always improve reasoning accuracy? Why does chain of thought accuracy eventually decline with length?. So 'longer' often means 'further from familiar ground and flailing,' which is exactly where extra backtracking anchors would proliferate without helping.
The deepest wrinkle is whether anchors are real reasoning at all. A parallel line of work argues traces are stylistic mimicry — corrupted or logically invalid traces perform nearly as well as clean ones, implying the tokens are computational scaffolding, not functional inference Do reasoning traces need to be semantically correct? Do reasoning traces actually cause correct answers? Do reasoning traces show how models actually think?. If that's right, 'thought anchor' names a position where resampling changes the output, not a place where the model 'decides' something. The thought-anchors finding insists these pivots are functional, not noise — so the unresolved question underneath yours is whether anchor density measures concentrated *reasoning* or just concentrated *formatting leverage*.
So: probably yes for planning anchors as a fraction of the trace, probably no for raw anchor count — and the reason short traces win may be less that they're denser in good pivots than that they're starved of the bad ones. If you want to chase the mechanism behind why long traces decay, the memorization-source breakdown is the doorway: local, preceding-token memorization drives up to 67% of reasoning errors and worsens with length and distribution shift Where do memorization errors arise in chain-of-thought reasoning?.
Sources 9 notes
Across QwQ, DeepSeek-R1, and LIMO, correct solutions average fewer tokens than incorrect ones. Longer traces correlate with more self-revisions, which introduce and compound errors rather than improve reasoning quality.
Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.