Does fine-tuning push models toward reasoning shortcuts that bypass the chain entirely?

This explores whether fine-tuning teaches models to produce correct answers while treating the reasoning chain as decorative — arriving at the answer some other way and writing the steps afterward.

This explores whether fine-tuning teaches models to produce correct answers while treating the reasoning chain as decorative. The corpus says yes, fairly directly — and the most striking finding is that you can't see it in accuracy scores. One set of faithfulness tests shows that after fine-tuning, you can cut a model's reasoning short, paraphrase it, or even swap in filler text, and the final answer stays the same more often than before Does fine-tuning disconnect reasoning steps from final answers?. If garbling the chain doesn't change the answer, the chain wasn't doing the work. A companion result puts a number on it: supervised fine-tuning raised benchmark accuracy while the actual inferential quality of the steps dropped almost 39%, because the model learned to rationalize a known answer rather than reason toward an unknown one Does supervised fine-tuning improve reasoning or just answers?. Standard metrics miss this entirely because they only check the last line.

The more unsettling possibility is that the chain may never have been load-bearing to begin with — fine-tuning just sharpens a shortcut that was always there. Chain-of-thought, on this reading, is constrained imitation: the model reproduces familiar reasoning *shapes* from training rather than performing fresh inference, which is why performance falls apart under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Reasoning traces work as persuasive appearances; logically invalid steps perform nearly as well as valid ones, so semantic correctness isn't what's generating the score Do reasoning traces show how models actually think?. Fine-tuning doesn't necessarily *create* the bypass — it rewards whatever produces the right final token, and pattern-matching is the cheapest path to that reward.

Reinforcement-style fine-tuning isn't exempt. Even GRPO-trained models crater on out-of-distribution variants of problems they handle in-distribution, which suggests RL is tightening template-matching rather than installing a transferable procedure Do fine-tuned language models actually learn optimization procedures?. The same texture shows up from another angle: models break not at a complexity threshold but at an unfamiliarity boundary — any chain succeeds if the instance resembles training data, regardless of how long the reasoning is Do language models fail at reasoning due to complexity or novelty?. That's the signature of a lookup dressed as a derivation.

What makes this worth knowing is the inversion it implies: the interventions that actually preserve reasoning tend to avoid weight updates altogether. Penalizing premature thought-switching at decode time improves accuracy with no fine-tuning Do reasoning models switch between ideas too frequently?, steering verbosity is a training-free activation-space edit Can we steer reasoning toward brevity without retraining?, and SoftCoT deliberately *freezes* the backbone — delegating the thinking to a small auxiliary module — specifically to keep fine-tuning from eroding the capability Can continuous reasoning avoid forgetting in instruction-tuned models?. Read together, the collection hints that the chain is most genuine when training touches it least, and that much of what we call "fine-tuning for reasoning" may be quietly teaching the model to skip it.

Sources 9 notes

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Does fine-tuning push models toward reasoning shortcuts that bypass the chain entirely?

Sources 9 notes

Next inquiring lines