How does data quality mismatch create reasoning degradation in supervised fine-tuning?
This explores a counterintuitive finding in the corpus: that fine-tuning can make a model's answers look better while making its actual reasoning worse — and what role the *content* and *difficulty* of training data plays in that gap.
This explores a counterintuitive finding in the corpus: that fine-tuning can make a model's answers look better while making its actual reasoning worse — and what role the content and difficulty of training data plays in that gap.
The sharpest version of the problem is what you might call the accuracy trap. Supervised fine-tuning reliably raises benchmark scores while quietly cutting the quality of the reasoning steps that get there — by roughly 39% on an "information gain" measure Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. The model learns to reach correct answers through pattern-matching shortcuts and post-hoc rationalization rather than genuine inference. Standard metrics miss this entirely because they only check whether the final answer is right. A companion finding shows the reasoning chain becomes causally disconnected from the answer: you can truncate it, paraphrase it, or swap in filler, and the model spits out the same answer anyway — the reasoning has become performance, not function Does fine-tuning disconnect reasoning steps from final answers?.
Here's where "data quality mismatch" gets surprising. Several notes suggest the model often isn't learning the *content* of your training data at all — it's learning the *shape* of the output. Models trained on semantically empty or even deliberately wrong instructions perform about as well as those trained on correct ones (43% vs. a 42.6% baseline); what transfers is knowledge of the output space, not task understanding Does instruction tuning teach task understanding or output format?. The same holds for reasoning traces themselves: systematically corrupted, irrelevant traces teach roughly as well as correct ones, implying the traces act as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?. On optimization problems, SFT makes outputs *look* correct — valid JSON, right sections — without making them physically feasible Does supervised fine-tuning actually improve reasoning on optimization problems?. So the "degradation" isn't that bad data poisons good reasoning; it's that SFT was teaching surface form all along, and once you measure reasoning directly the illusion breaks.
That reframes what a *mismatch* even is. If you actually want criteria to transfer — say, judging argument quality — labeled examples alone fail, because the model learns surface patterns instead of principles; you need explicit theoretical frameworks baked into instruction Can models learn argument quality from labeled examples alone?. Difficulty mismatch bites too: training on problems that are too hard for the model rewards rare accidental successes as if they were skill, amplifying degenerate shortcuts that then contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. And a related RL result shows fine-tuning sharpening memorization rather than installing procedures — performance collapses on out-of-distribution variants of the same problem Do fine-tuned language models actually learn optimization procedures?.
The quietly hopeful counter-thread is that the damage is largely about *where* fine-tuning writes. Direct fine-tuning corrupts knowledge stored in lower layers, but decoding-time proxy-tuning closes most of the alignment gap while leaving base weights untouched — shifting style and reasoning without overwriting stored knowledge Can decoding-time tuning preserve knowledge better than weight fine-tuning?. And LIMA's finding that 1,000 carefully curated examples rival datasets orders of magnitude larger points the same direction: post-training mostly *activates* capabilities the pretrained model already has rather than building new ones Can careful curation replace massive alignment datasets?. The thing you didn't know you wanted to know: reasoning degradation from SFT may be less about feeding the model wrong answers and more about how little of your data's meaning it was ever absorbing — which is why curation and where-you-tune matter more than sheer volume.
Sources 11 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.