Does supervised fine-tuning improve reasoning or just response formatting?
This explores whether supervised fine-tuning (SFT) actually teaches a model to reason better, or whether it mostly polishes the surface — getting the final answer right while the inferential work underneath stays shallow or decorative.
This explores whether SFT improves genuine reasoning or just response formatting — and the corpus comes down hard on the second answer, with a twist: SFT can raise your benchmark score while quietly making the model reason worse. Two independent measurements find that fine-tuning lifts final-answer accuracy but cuts the informativeness of the actual reasoning steps by about 38.9% Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. The model learns to land on correct answers through pattern-matching shortcuts and post-hoc rationalization, not through genuine step-by-step inference. Standard metrics miss this entirely because they only check whether the final answer is right.
The most striking evidence that the reasoning becomes decorative comes from looking inside the model. One study shows fine-tuned transformers actually compute the correct answer in their earliest layers (1-3), then actively suppress that representation in later layers to emit format-compliant filler tokens Do transformers hide reasoning before producing filler tokens?. The visible 'reasoning' is theater layered on top of a hidden computation. A complementary set of faithfulness tests confirms the disconnect: after fine-tuning, you can cut a reasoning chain short, paraphrase it, or swap in filler, and the final answer often doesn't change Does fine-tuning disconnect reasoning steps from final answers?. If the steps don't causally drive the answer, they're performance, not process.
Why does this happen? A clue from elsewhere in the corpus reframes the whole question: base models already contain latent reasoning ability, and post-training mostly *selects* which capability to surface rather than *creating* new capability Do base models already contain hidden reasoning ability?. From that lens, SFT isn't installing reasoning at all — it's teaching the model which output format to present. That's also why fine-tuning on labeled examples fails to transfer real criteria: models pick up surface patterns instead of principled rules, and only explicit theoretical frameworks teach the actual quality judgments Can models learn argument quality from labeled examples alone?. Even RL fine-tuning isn't immune — out-of-distribution tests show RL-tuned models still lean on memorized templates rather than learned procedures Do fine-tuned language models actually learn optimization procedures?.
The interesting part is what the corpus offers as the way out — and it points away from plain SFT toward training signals that reward *how* the model gets there, not just *what* it lands on. Reinforcement learning from augmented generation rewards explanation rationality alongside answer accuracy, internalizing coherent knowledge structures that token-level SFT can't Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. DPO, which trains on explicit correct-vs-incorrect pairs, beats SFT precisely on the rigid format failures where SFT stalls Can small models match large models on function calling?. And using the model's own answer-span confidence as a reward signal strengthens step-by-step reasoning while repairing calibration Can model confidence work as a reward signal for reasoning?. The throughline: methods that grade the reasoning trace, not just the final token, are the ones that actually move reasoning rather than formatting. So the unsettling takeaway is that a higher benchmark number after SFT may be a signal that your model is reasoning *less* — and that you'd never know it from the leaderboard.
Sources 10 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.