Do reasoning models trade instruction following for deliberative capability?
This reads the question as: when a model is tuned to think harder and longer, does it get worse at simply doing what it was told? Worth flagging up front — the corpus doesn't directly measure instruction-following, but it does map the hidden costs of deliberation, which is where the real tension lives.
This explores whether deliberation comes at the expense of just-do-what-was-asked obedience. The corpus here doesn't contain a head-to-head study pitting instruction-following against reasoning depth — so if that exact tradeoff is what you're after, this collection answers it sideways rather than directly. But the sideways answer is more interesting than a clean yes/no, because it shows deliberation is not free even on its own terms.
The sharpest evidence that more thinking can actively hurt comes from two directions. First, extended thinking is not inherently good: a vanilla model left to ruminate often talks itself into self-doubt and degrades its own answers, and it takes RL training to redirect that same machinery toward useful gap analysis rather than spiraling Does extended thinking help or hurt model reasoning?. Second, step-by-step reasoning is the wrong move for some questions entirely — for simple prompts, a direct question-to-answer path beats forced chain-of-thought, and whether deliberation helps depends on the question's shape, not the task label Why do some questions perform better without step-by-step reasoning?. So the cost isn't only instruction-following — it's that deliberation applied indiscriminately makes models worse at things they'd otherwise nail.
There's a second flavor of cost: deliberation that wanders. Reasoning models don't search systematically; they explore unsystematically, and their success probability drops exponentially as problems get deeper Why do reasoning LLMs fail at deeper problem solving?. Depth-only chains fall into an 'underthinking' trap that structured, breadth-first abstractions can rescue Can abstractions guide exploration better than depth alone?. And longer inputs alone — well below the context limit — knock reasoning accuracy from 92% to 68%, even with chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. If you think of instruction-following as 'staying anchored to what was asked,' these are exactly the failure modes that pull a model off-anchor: it gets lost in its own exploration.
The deeper reframe is whether the deliberation is even doing what it claims. One line of work argues chain-of-thought is constrained imitation of reasoning *form* — reproducing familiar patterns from training — rather than genuine inference, which is why it breaks predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?. If that's right, then the 'deliberative capability' a model trades for isn't a stable new skill at all, which changes what the tradeoff even means. Counterweight: other work finds the reasoning was latent in the base model all along, merely *elicited* by post-training rather than created Do base models already contain hidden reasoning ability?, and that verbalized thinking tokens may be a training artifact rather than a requirement, since models can scale compute in latent space without speaking a single step Can models reason without generating visible thinking tokens?.
Put together, the corpus reframes your question: the real tension may not be 'instruction-following vs. deliberation' but 'verbosity and unsystematic wandering vs. anchored, well-calibrated thinking.' The encouraging implication is that the cost is largely a tuning problem, not an inherent law — verbosity is a single steerable direction in activation space that can be cut 67% without losing accuracy Can we steer reasoning toward brevity without retraining?, and training (not just more thinking) is what decides whether deliberation helps or hurts Does extended thinking help or hurt model reasoning?. The thing you didn't know you wanted to know: a model that thinks more isn't trading away obedience so much as risking getting lost — and 'staying on task' turns out to be a steerable, trainable property rather than a casualty of intelligence.
Sources 9 notes
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.