LLM Reasoning and Architecture Reinforcement Learning for LLMs

Does supervised fine-tuning actually improve reasoning quality?

While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.

Note · 2026-02-21 · sourced from Domain Specialization
How do you build domain expertise into general AI models? How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The Knowledge or Reasoning paper introduces a framework that separates what most benchmarks conflate: factual knowledge accuracy and reasoning quality. They define two metrics: Knowledge Index (KI) — whether each reasoning step invokes factually correct domain knowledge — and Information Gain (InfoGain) — whether each reasoning step actually reduces uncertainty toward the final answer. These are different things, and models can do well on one while failing the other.

The finding on SFT is stark. Fine-tuning on domain data raises final-answer accuracy in both mathematical and medical domains. But InfoGain drops by 38.9% on average compared to untrained models. The models are reaching correct answers with less informative reasoning paths — shorter, more direct routes that skip over the inferential steps that actually justify the conclusion.

This is not a paradox once you understand the training signal. SFT rewards answers. The training data contains question-answer pairs, possibly with reasoning chains, but the loss function is ultimately anchored to producing the correct final output. Models learn the most efficient path to the answer in the training distribution — which often means picking up domain-specific shortcuts and pattern matches rather than reasoning through the problem. The "reasoning" in the output becomes a post-hoc rationalization of the answer, not the path by which the answer was reached.

The deployment implication is significant for high-stakes domains. Medical decision support, legal AI, and financial analysis all require not just correct answers but auditable reasoning — the ability to show why the conclusion follows from the evidence. If SFT degrades InfoGain, systems trained this way are less auditable even when they are more accurate. The accuracy improvement is real; it also masks the reasoning regression.

The "SFT Memorizes, RL Generalizes" paper provides cross-domain confirmation in the visual reasoning domain. SFT models display classic memorization symptoms — high sensitivity to input distribution with sharp accuracy drops on out-of-distribution examples — while RL-trained models maintain stable reasoning performance across distribution shifts. The SFT-trained models develop stronger memorization of specific visual patterns rather than learning generalizable reasoning strategies. This extends the KI/InfoGain finding beyond medical and math: the SFT accuracy trap is domain-general, operating wherever the training signal rewards pattern matching over genuine reasoning.

A third dimension of SFT degradation comes from CoT faithfulness. Does fine-tuning weaken how reasoning steps influence answers? shows that after fine-tuning, the chain-of-thought reasoning steps become less causally connected to the final answer — the model reaches answers through internal shortcuts while the visible reasoning chain becomes increasingly decorative. This means SFT simultaneously degrades three separable properties: reasoning informativeness (InfoGain), epistemic calibration (Does reasoning fine-tuning make models worse at declining to answer?), and reasoning faithfulness — all while improving the accuracy metric that standard benchmarks measure.

This adds a training-time dimension to the existing cluster of overthinking findings. Does extended thinking actually improve reasoning or just increase variance? shows that more inference compute can degrade quality at test time. Does more thinking time always improve reasoning accuracy? shows the same at the token-count level. SFT degrades reasoning quality in a third way: by changing what the training signal rewards, independently of inference behavior. The pattern extends across timescales: Do iterative refinement methods suffer from overthinking? shows the same quality-accuracy trade-off operating through sequential revision cycles, and Does fine-tuning on NLI teach inference or amplify shortcuts? shows fine-tuning amplifying distribution shortcuts rather than genuine skill across a different domain.


Source: Domain Specialization; enriched from Training Fine Tuning

Related concepts in this collection

Concept map
24 direct connections · 236 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

sft raises domain accuracy but degrades reasoning quality by 38 percent infogain loss