INQUIRING LINE

Why does concise reasoning maintain accuracy with far fewer tokens?

This explores why stripping reasoning down to a fraction of its usual length doesn't hurt accuracy — and what that tells us about which parts of a reasoning trace are actually doing the work.


This explores why concise reasoning holds its accuracy while cutting most of the tokens — and the corpus has a surprisingly consistent answer: most of what a model writes when it 'thinks out loud' isn't computation, it's documentation. Chain of Draft makes this almost literal, matching standard chain-of-thought accuracy on arithmetic, symbolic, and commonsense tasks while using just 7.6% of the tokens — meaning the other 92% served style and explanation, not the actual solving Can minimal reasoning chains match full explanations?. When researchers prune reasoning chains token-by-token, the same picture emerges: symbolic computation tokens get preserved while grammar and meta-commentary get dropped first, and there's a small minority of 'forking point' tokens — about 20% — where the model actually makes its pivotal decisions Which tokens in reasoning chains actually matter most? Do high-entropy tokens drive reasoning model improvements?. Concise reasoning works because it keeps the load-bearing minority and discards the filler.

There's a second, sharper reason: longer is often actively worse, not just wasteful. In o1-style models, correct traces are *shorter* than incorrect ones — extra length tends to come from self-revisions that introduce and compound errors rather than fix them Why do correct reasoning traces contain fewer tokens?. Push the thinking budget hard and accuracy can collapse outright: one study found going from ~1,100 to ~16K thinking tokens dropped benchmark accuracy from 87% to 70%, as models overthink easy problems Does more thinking time always improve reasoning accuracy?. The optimal length turns out to follow an inverted-U, and — tellingly — RL training naturally drifts toward shorter chains as models get better, so brevity emerges as a reward signal rather than something you have to force Why does chain of thought accuracy eventually decline with length?. Part of why more tokens can hurt: reasoning quality degrades as the input grows, dropping from 92% to 68% accuracy with just 3,000 tokens of padding, far below the context limit Does reasoning ability actually degrade with longer inputs?. A bloated trace becomes part of its own distracting input.

The most provocative thread asks whether the verbal trace is necessary at all. You can extract a single 'verbosity direction' in activation space and steer a model to cut chain-of-thought length 67% with no retraining and no accuracy loss — verbose and concise reasoning literally occupy distinct regions of the model's internal space Can we steer reasoning toward brevity without retraining?. Go further and models can scale test-time compute in latent space without verbalizing any intermediate steps, suggesting the words are a training artifact rather than the reasoning itself Can models reason without generating visible thinking tokens?. Logit-lens work shows models sometimes compute the correct answer in the first few layers, then overwrite it with format-compliant filler — the visible tokens are a costume, not the thinking Do transformers hide reasoning before producing filler tokens?.

If you want the structural version of this idea, two notes reframe reasoning to shed its own baggage: Atom of Thoughts makes each step depend only on the current subproblem — a 'memoryless' contraction that drops the accumulated history that bloats long chains while keeping the answer intact Can reasoning systems forget history without losing coherence?, and Large Concept Models reason over whole sentence embeddings instead of individual tokens, planning at a higher level of abstraction entirely Can reasoning happen at the sentence level instead of tokens?. The thing worth taking away: concise reasoning doesn't trade accuracy for speed. The verbosity was never where the accuracy lived.


Sources 12 notes

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Why do correct reasoning traces contain fewer tokens?

Across QwQ, DeepSeek-R1, and LIMO, correct solutions average fewer tokens than incorrect ones. Longer traces correlate with more self-revisions, which introduce and compound errors rather than improve reasoning quality.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Next inquiring lines