What evidence shows that reasoning chains encode token-level functional structure?

This explores what we actually know about whether some tokens inside a reasoning chain do real computational work while others are filler — and the evidence that models treat them differently rather than uniformly.

This explores whether reasoning chains have an internal functional skeleton — tokens that carry the computation versus tokens that are just connective tissue — and what evidence supports that picture. The most direct finding comes from a study that prunes reasoning chains greedily while preserving the model's own likelihood: when you do this, six distinct functional categories of tokens fall out, and the model preferentially protects the ones doing symbolic computation while shedding grammar and meta-discourse first Which tokens in reasoning chains actually matter most?. Strikingly, student models trained on these functionally-pruned chains outperform those trained on chains compressed by a frontier model — meaning the functional ranking isn't a curiosity, it's information you can train on.

That result lands differently depending on what else you believe about chain-of-thought. A large strand of the corpus argues CoT isn't genuine inference at all: it's constrained imitation of reasoning *form*, where format and spatial layout matter far more than logical content, invalid prompts work nearly as well as valid ones, and performance degrades exactly the way imitation (not capability) would predict Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning actually work? What makes chain-of-thought reasoning actually work?. Even deliberately corrupted traces teach about as well as correct ones Do reasoning traces need to be semantically correct?, and traces read more as persuasive appearances than verified computation Do reasoning traces show how models actually think?. Hold these together and a sharper claim emerges: the chain is scaffolding, but *not all of the scaffold is load-bearing.* Functional-importance ranking and the corrupted-trace results agree more than they seem to — both say semantic prose is dispensable while structural/computational positions carry the weight.

The mechanistic evidence makes the "some tokens do the work" story concrete. Logit-lens analysis shows transformers can compute the correct answer in layers 1–3 and then actively overwrite it to emit format-compliant filler — the real reasoning is recoverable from lower-ranked predictions, hidden beneath the surface tokens Do transformers hide reasoning before producing filler tokens?. Memorization analysis tells the complementary story from the failure side: token-level errors have three sources, and *local* memorization based on immediately preceding tokens drives up to 67% of reasoning errors Where do memorization errors arise in chain-of-thought reasoning?. Both say the token position is functionally specialized — some positions compute, some copy, some hide.

There's a cross-cutting tension worth surfacing. If functional structure lives at the token level, why does reasoning sometimes work *without* tokens at all? Latent-reasoning architectures scale test-time compute through hidden-state iteration with no verbalized steps Can models reason without generating visible thinking tokens?, and large concept models reason over whole sentence embeddings in a language-agnostic space Can reasoning happen at the sentence level instead of tokens?. The reconciliation: verbalization is a training artifact, and the token-level functional structure is the *visible trace* of computation that can also happen invisibly. When models do reason in semantic rather than symbolic mode, performance collapses once content is decoupled from familiar associations Do large language models reason symbolically or semantically?, and failures track instance novelty rather than task complexity Do language models fail at reasoning due to complexity or novelty? — which is what you'd expect if the "functional" tokens are computing over memorized patterns, not abstract algorithms.

The thing you didn't know you wanted to know: the same evidence that says reasoning chains are imitation rather than real inference is *also* the evidence that they have genuine internal structure. The chain isn't uniformly fake — it has a functional anatomy, where a minority of tokens carry the computation and the rest are prunable prose. That's why you can compress a chain by importance and end up with a *better* training signal, not a worse one.

Sources 12 notes

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

What evidence shows that reasoning chains encode token-level functional structure?

Sources 12 notes

Next inquiring lines