Why do concise reasoning chains match verbose chain-of-thought token efficiency?

This explores why a model can strip most of its reasoning chain down to a fraction of the tokens and still land the same answers — what those removed tokens were actually doing, and why they weren't load-bearing.

This explores why concise reasoning chains match verbose chain-of-thought — and the short version is that most of the tokens in a long reasoning trace aren't doing the reasoning. When Chain of Draft achieves equivalent accuracy at just 7.6% of the token cost, the implication is blunt: the other 92.4% of tokens served style, fluency, and documentation, not computation Can minimal reasoning chains match full explanations?. The verbose chain reads like thinking, but the part that changes the outcome is a thin slice inside it.

Several notes in the corpus independently locate that thin slice. When you greedily prune a reasoning chain to preserve the model's own likelihood, tokens sort into functional categories — symbolic computation gets preserved first, while grammar and meta-discourse are dropped first Which tokens in reasoning chains actually matter most?. From the training side, only about 20% of tokens are high-entropy 'forking points' where the model actually decides something, and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?. And using attention maps at test time, verification and backtracking steps turn out to receive almost no downstream attention, so cutting 75% of steps leaves accuracy intact Can reasoning steps be dynamically pruned without losing accuracy?. Three different lenses — likelihood, entropy, attention — keep pointing at the same small minority of tokens carrying the work.

There's a reason verbose chains aren't just neutral padding but can actively hurt: optimal CoT length follows an inverted-U, peaking at intermediate length and declining when chains run long, with more capable models preferring shorter chains Why does chain of thought accuracy eventually decline with length?. That dovetails with a separate finding that reasoning accuracy degrades sharply as inputs get longer, well below the context limit — a 92%-to-68% drop from mere padding Does reasoning ability actually degrade with longer inputs?. So a long chain isn't free; it's extra length the model then has to reason over. Conciseness isn't a tax you pay for efficiency — past a point it's where accuracy lives.

The deeper 'why' is what CoT is in the first place. If chain-of-thought were genuine step-by-step symbolic inference, every step would be necessary. But the corpus argues it's constrained imitation of reasoning *form* — pattern-guided generation where format matters 7.5× more than content and even invalid reasoning prompts work What makes chain-of-thought reasoning actually work? Does chain-of-thought reasoning reveal genuine inference or pattern matching?. If much of the verbal scaffolding is performing the shape of reasoning rather than computing, it follows that you can delete the performance and keep the computation. The most striking evidence: models often commit to an answer internally before they finish writing the chain on easy problems, letting probe-guided early exit cut up to 80% of tokens with no accuracy loss — though on genuinely hard problems the reasoning does track real belief updates Does chain-of-thought reasoning reflect genuine thinking or performance?.

The place this leaves you knowing something you didn't expect: the verbalization may not be required at all. Latent-reasoning architectures scale test-time compute by iterating in hidden state, with no verbalized intermediate tokens — suggesting that writing the chain out is a training artifact, not a reasoning requirement Can models reason without generating visible thinking tokens?. Concise chains match verbose ones for the same reason silent reasoning can match spoken reasoning: the words were always a readable shadow of a much smaller computation.

Sources 10 notes

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning reflect genuine thinking or performance?

Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Why do concise reasoning chains match verbose chain-of-thought token efficiency?

Sources 10 notes

Next inquiring lines