What happens to reasoning accuracy when models use more thinking tokens?

This explores what actually happens to accuracy as a reasoning model spends more thinking tokens — and the corpus says the answer is non-monotonic, not 'more is better.'

This explores what happens to reasoning accuracy as models burn more thinking tokens, and the headline from the corpus is counterintuitive: accuracy peaks and then falls. One direct measurement found that pushing thinking tokens from ~1,100 up to ~16K dragged benchmark accuracy down from 87.3% to 70.3% — models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. That isn't a fluke of one benchmark; the same shape shows up as a general law. Optimal chain-of-thought length traces an inverted-U: accuracy rises to an intermediate length, then declines, with the sweet spot shifting longer for harder tasks but *shorter* for more capable models Why does chain of thought accuracy eventually decline with length?. So 'more thinking' helps only up to a point that depends on both the problem and the model.

Why does extra thinking backfire? Two failure modes the corpus names are worth knowing. One is that vanilla models use the thinking budget to talk themselves into self-doubt — extra tokens become second-guessing that degrades the answer, until RL training redirects that same machinery toward productive gap analysis Does extended thinking help or hurt model reasoning?. The other is 'underthinking': o1-style models abandon promising reasoning paths mid-stream and thrash between ideas, spending tokens on half-explored approaches. Simply penalizing thought-switching during decoding — no retraining — improves accuracy on hard math Do reasoning models switch between ideas too frequently?. Quantity, in other words, is the wrong knob; what those tokens *do* is the thing.

That points to a deeper finding: not all thinking tokens carry weight, and most don't. Only about 20% of tokens are high-entropy 'forking points' where the reasoning genuinely branches, and training on just those matches or beats full-gradient updates Do high-entropy tokens drive reasoning model improvements?. A handful of tokens like 'Wait' and 'Therefore' spike in mutual information with the correct answer — suppress them and accuracy drops, while suppressing the same number of random tokens does nothing Do reflection tokens carry more information about correct answers?. Models even internally rank tokens by function, preferentially preserving symbolic computation and pruning grammar and meta-chatter first Which tokens in reasoning chains actually matter most?. So a longer trace is mostly padding around a few load-bearing decisions — which is exactly why piling on tokens dilutes rather than helps.

Here's the part that should unsettle the intuition that the trace *is* the reasoning: it largely isn't. Models trained on deliberately corrupted, irrelevant traces stay just as accurate — sometimes generalizing *better* out of distribution — suggesting traces are computational scaffolding, not meaningful steps Do reasoning traces need to be semantically correct?. Invalid traces routinely produce correct answers, which means the visible tokens correlate with the answer through learned formatting, not causal execution Do reasoning traces actually cause correct answers?. Transformers actually compute the answer in their early layers and then overwrite it with format-compliant filler Do transformers hide reasoning before producing filler tokens?, and other architectures scale test-time compute entirely in latent space with no verbalized tokens at all Can models reason without generating visible thinking tokens?. If reasoning can happen without visible tokens, then visible-token count was never the real lever.

The practical upshot: you can keep accuracy while slashing length. A single steering vector extracted from 50 paired examples cuts chain-of-thought by 67% with a 2.73x speedup and no accuracy loss, training-free Can we steer reasoning toward brevity without retraining?. Read together, the collection rewrites the question: more thinking tokens don't buy more accuracy past a task-and-model-dependent peak, because accuracy lives in a small set of pivotal tokens — or even in computation that never surfaces as tokens at all. The lever to reach for is *which* tokens and *whether the reasoning is well-directed*, not how many.

Sources 12 notes

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

What happens to reasoning accuracy when models use more thinking tokens?

Sources 12 notes

Next inquiring lines