INQUIRING LINE

Do shorter reasoning chains maintain instruction adherence better than longer ones?

This reads the question as: does keeping a model's reasoning short actually help it stay on track — both in accuracy and in following what was asked — compared to letting it ramble at length?


This explores whether shorter chains of thought keep a model better anchored — to the task and to the instruction — than long ones, and the corpus offers a surprisingly strong yes, though it reframes *why*. The cleanest finding is that reasoning quality follows an inverted-U against length: accuracy peaks at some intermediate number of steps and then declines, with the optimal length shrinking as the model gets more capable Why does chain of thought accuracy eventually decline with length?. Strikingly, reinforcement-learning training pushes models *toward* shorter chains as they improve — brevity isn't imposed, it emerges from the reward signal. So past a point, more reasoning is not more thinking; it's drift.

Why does length hurt adherence specifically? Because each extra step is another place to wander. One decomposition of CoT shows genuine reasoning does happen, but it accumulates error with every step, sitting alongside two non-reasoning factors (raw output probability and memorization) that quietly steer the answer What three separate factors drive chain-of-thought performance?. A complementary memorization study finds that *local* memorization — the model latching onto its own immediately preceding tokens — drives up to 67% of reasoning errors, and that share grows as the chain gets longer and complexity rises Where do memorization errors arise in chain-of-thought reasoning?. In other words, a long chain increasingly takes its cues from what it just said rather than from the original instruction. That's the mechanism behind 'losing the thread.'

The corpus also shows that more text in front of the model — not just more reasoning — degrades fidelity. Reasoning accuracy falls from 92% to 68% with just 3,000 tokens of padding, far below the context window's limit, and chain-of-thought prompting doesn't rescue it Does reasoning ability actually degrade with longer inputs?. So verbosity cuts both ways: a bloated chain is itself the kind of long input that erodes the model's grip on the task.

Here's the part that should reassure anyone worried that cutting length costs capability: it largely doesn't. 'Chain of Draft' matches full chain-of-thought accuracy on arithmetic, symbolic, and commonsense tasks using only 7.6% of the tokens — the other 92% served style and documentation, not computation Can minimal reasoning chains match full explanations?. And brevity turns out to be a steerable *direction* in the model's activations: a single vector extracted from 50 examples cuts chain length 67% while holding accuracy, with no retraining Can we steer reasoning toward brevity without retraining?. Verbose and concise reasoning literally occupy distinct regions of the model's internal space.

Two caveats keep this from being 'shorter is always better.' First, length should track difficulty: optimal chain length rises with task difficulty even as it falls with model capability Why does chain of thought accuracy eventually decline with length?, and for simple questions, step-by-step reasoning can actively hurt when the question's information doesn't flow into the prompt first Why do some questions perform better without step-by-step reasoning?. Second, *quantity* isn't the real lever — *quality* is: RL training can flip extended thinking from counterproductive self-doubt into productive analysis using the same mechanism Does extended thinking help or hurt model reasoning?. The honest synthesis: shorter chains tend to maintain adherence better not because length is inherently bad, but because most of the length in a long chain is doing no work — and every idle token is a fresh opportunity to drift from the instruction.


Sources 8 notes

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Next inquiring lines