Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
The Knowledge or Reasoning paper introduces a framework that separates what most benchmarks conflate: factual knowledge accuracy and reasoning quality. They define two metrics: Knowledge Index (KI) — whether each reasoning step invokes factually correct domain knowledge — and Information Gain (InfoGain) — whether each reasoning step actually reduces uncertainty toward the final answer. These are different things, and models can do well on one while failing the other.
The finding on SFT is stark. Fine-tuning on domain data raises final-answer accuracy in both mathematical and medical domains. But InfoGain drops by 38.9% on average compared to untrained models. The models are reaching correct answers with less informative reasoning paths — shorter, more direct routes that skip over the inferential steps that actually justify the conclusion.
This is not a paradox once you understand the training signal. SFT rewards answers. The training data contains question-answer pairs, possibly with reasoning chains, but the loss function is ultimately anchored to producing the correct final output. Models learn the most efficient path to the answer in the training distribution — which often means picking up domain-specific shortcuts and pattern matches rather than reasoning through the problem. The "reasoning" in the output becomes a post-hoc rationalization of the answer, not the path by which the answer was reached.
The deployment implication is significant for high-stakes domains. Medical decision support, legal AI, and financial analysis all require not just correct answers but auditable reasoning — the ability to show why the conclusion follows from the evidence. If SFT degrades InfoGain, systems trained this way are less auditable even when they are more accurate. The accuracy improvement is real; it also masks the reasoning regression.
The "SFT Memorizes, RL Generalizes" paper provides cross-domain confirmation in the visual reasoning domain. SFT models display classic memorization symptoms — high sensitivity to input distribution with sharp accuracy drops on out-of-distribution examples — while RL-trained models maintain stable reasoning performance across distribution shifts. The SFT-trained models develop stronger memorization of specific visual patterns rather than learning generalizable reasoning strategies. This extends the KI/InfoGain finding beyond medical and math: the SFT accuracy trap is domain-general, operating wherever the training signal rewards pattern matching over genuine reasoning.
A third dimension of SFT degradation comes from CoT faithfulness. Does fine-tuning weaken how reasoning steps influence answers? shows that after fine-tuning, the chain-of-thought reasoning steps become less causally connected to the final answer — the model reaches answers through internal shortcuts while the visible reasoning chain becomes increasingly decorative. This means SFT simultaneously degrades three separable properties: reasoning informativeness (InfoGain), epistemic calibration (Does reasoning fine-tuning make models worse at declining to answer?), and reasoning faithfulness — all while improving the accuracy metric that standard benchmarks measure.
This adds a training-time dimension to the existing cluster of overthinking findings. Does extended thinking actually improve reasoning or just increase variance? shows that more inference compute can degrade quality at test time. Does more thinking time always improve reasoning accuracy? shows the same at the token-count level. SFT degrades reasoning quality in a third way: by changing what the training signal rewards, independently of inference behavior. The pattern extends across timescales: Do iterative refinement methods suffer from overthinking? shows the same quality-accuracy trade-off operating through sequential revision cycles, and Does fine-tuning on NLI teach inference or amplify shortcuts? shows fine-tuning amplifying distribution shortcuts rather than genuine skill across a different domain.
Source: Domain Specialization; enriched from Training Fine Tuning
Related concepts in this collection
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
different SFT cost: calibration (abstention) vs. reasoning quality (InfoGain); both hidden in accuracy metrics
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
parallel inference-time degradation
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
RL addresses the reasoning quality deficit that SFT creates
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
task-type specificity of reasoning value
-
Do iterative refinement methods suffer from overthinking?
Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
same quality-accuracy trade-off at test-time revision rather than training-time fine-tuning
-
Does fine-tuning on NLI teach inference or amplify shortcuts?
When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.
same pattern: fine-tuning amplifies distribution shortcuts rather than teaching underlying skill, cross-domain confirmation
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER provides the representational explanation for the SFT accuracy trap: SFT may produce fractured internal representations that yield correct answers through pattern-matching shortcuts (identical performance) while lacking the unified principles needed for informative reasoning chains
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
sft raises domain accuracy but degrades reasoning quality by 38 percent infogain loss