Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
Post angle for Medium / LinkedIn
Hook: "Every AI benchmark measures accuracy. What if accuracy is exactly the wrong thing to measure when deploying AI in high-stakes domains?"
The finding: The Knowledge or Reasoning paper introduces two new metrics — Knowledge Index (KI: factual correctness of each reasoning step) and Information Gain (InfoGain: how much each reasoning step reduces uncertainty toward the final answer). When they apply these metrics to SFT-trained models on medical and mathematical tasks, they find that SFT raises final-answer accuracy while cutting InfoGain by 38.9%. Models get more answers right while reasoning toward them less informationally.
The mechanism: SFT rewards answers, not reasoning paths. Training data has question-answer pairs. The loss function anchors on the correct final output. Models learn the most efficient path to the right answer in the training distribution — often domain-specific shortcuts, pattern matches, and frequency-weighted heuristics that produce the correct answer without the inferential chain that would justify it. The reasoning in the output becomes post-hoc rationalization.
Why this matters for deployment: High-stakes domains don't just need correct answers — they need auditable reasoning. Medical decision support must show clinical logic. Legal AI must demonstrate how conclusions follow from statute and precedent. Financial AI must show how recommendations connect to market data and regulatory context. SFT improves the answer, but may make the reasoning path less meaningful — more verbose decoration around the correct output than the pathway that produced it.
The measurement problem: Standard benchmarks measure what's easy to measure: whether the final answer matches the ground truth. InfoGain and KI require decomposing reasoning chains and evaluating each step against external ground truth — expensive and difficult to automate at scale. So the measurement gap persists, and every organization that deploys based on benchmark accuracy is systematically blind to the reasoning quality regression.
The connection: This extends the existing cluster of overthinking findings into the training dimension. Does extended thinking actually improve reasoning or just increase variance? at inference-time. Does reasoning fine-tuning make models worse at declining to answer? at training-time for a different cost (calibration). The SFT accuracy trap is the third entry: training-time cost to reasoning quality.
Platform: Medium (1000–1400 words). Could lead with the FALM / medical AI deployment angle, then introduce the measurement framework.
Source: Domain Specialization
Related concepts in this collection
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
the underlying insight this post dramatizes
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
parallel SFT cost: calibration vs. reasoning quality
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
inference-time version of the same accuracy vs. quality trade-off
-
Does critiquing errors teach deeper understanding than imitating correct answers?
Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
counter-strategy: CFT addresses the SFT accuracy trap by replacing correct-answer imitation with structured failure analysis as the training objective
-
Why do better reasoning models ignore instructions?
As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?
a third dimension of SFT/RL training cost: SFT degrades reasoning quality (this note), reasoning training degrades instruction adherence (instruction-following deficit), and both reflect the same pattern — optimizing one capability structurally degrades another
-
Can language models solve ToM benchmarks without real reasoning?
Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.
ToM benchmarks are a concrete case of the SFT accuracy trap: SFT achieves competitive ToM scores without reasoning training, suggesting benchmarks reward structural pattern exploitation rather than genuine mental state reasoning
-
Why does SFT-then-RL training follow a predictable three-phase pattern?
When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.
temporal dynamics: CHORD reveals the SFT accuracy trap as the first phase of a three-phase progression; RL can recover from SFT's reasoning degradation but only if SFT and RL are integrated as a continuous spectrum rather than hard-sequenced stages
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
parallel training-induced degradation: SFT degrades reasoning quality (this note) while RLHF degrades conversational grounding; both demonstrate that optimizing for what benchmarks and raters measure structurally erodes capabilities that require different evaluation frameworks
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER provides the representational diagnosis: SFT may produce fractured internal representations that yield correct answers through pattern-matching shortcuts while the underlying structure is broken in ways standard benchmarks cannot detect
-
Does fine-tuning weaken how reasoning steps influence answers?
When models are fine-tuned on domain-specific tasks, do their chain-of-thought reasoning steps actually causally drive the final answer, or do they become decorative? This matters because accurate outputs can mask unfaithful reasoning.
a second dimension of SFT damage beyond InfoGain: fine-tuning reduces how much reasoning steps causally influence the final answer, making the chain performative rather than functional; together with InfoGain degradation, SFT damages both reasoning quality and reasoning faithfulness
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
the sft accuracy trap — training raises benchmark scores while degrading reasoning quality