Does supervised fine-tuning improve accuracy while damaging the quality of reasoning?

This explores whether supervised fine-tuning (SFT) buys higher benchmark scores at the cost of how a model actually reasons — and what the corpus says is really happening underneath the accuracy gain.

This explores whether supervised fine-tuning trades genuine reasoning for better answers. The corpus says: yes, and the trade is sharper than benchmark scores reveal. Two converging measurements find that SFT raises final-answer accuracy while cutting the *informativeness* of reasoning steps by roughly 38.9% — models start reaching correct answers through post-hoc rationalization and pattern-matching shortcuts rather than genuine inferential steps Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. The unsettling part is that standard metrics can't see this, because they only check whether the final answer is right.

What does "degraded reasoning" actually mean mechanically? A third strand makes it concrete: after fine-tuning, the reasoning chain stops *driving* the answer. When you truncate the chain early, paraphrase it, or swap in filler text, the final answer often stays the same — meaning the steps have become performative decoration rather than functional computation Does fine-tuning disconnect reasoning steps from final answers?. You see the same surface-over-substance pattern in narrower domains: on optimization problems, SFT teaches models to produce outputs that *look* correct — valid JSON, proper sections — without making the underlying solutions physically feasible Does supervised fine-tuning actually improve reasoning on optimization problems?. And on argument quality, fine-tuning on labeled examples picks up surface patterns instead of the principled criteria, so it fails to generalize to new argument types Can models learn argument quality from labeled examples alone?.

Here's the twist that makes the whole picture stranger: maybe reasoning traces were never doing the work we assumed. Models trained on *deliberately corrupted*, irrelevant traces match the accuracy of models trained on correct ones — and sometimes generalize better out-of-distribution — which suggests the trace functions as computational scaffolding rather than meaningful thought Do reasoning traces need to be semantically correct?. If true, SFT isn't destroying reasoning so much as exposing that token-level imitation optimizes the wrong thing. That reframes the problem from "don't damage reasoning" to "the reasoning capability was already in the base model and post-training mostly selects, not creates, it" Do base models already contain hidden reasoning ability? Does RL post-training create reasoning or just deploy it?.

So what works better? The corpus points toward training that rewards *reasoning quality*, not just correct tokens. RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, internalizing coherent knowledge structures that plain SFT misses Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. Using the model's own answer-confidence as a reward strengthens step-by-step reasoning while repairing the calibration that other training stages erode Can model confidence work as a reward signal for reasoning?. But this isn't a clean win for RL either — RL-fine-tuned models still collapse on out-of-distribution variants, suggesting they often sharpen memorized templates rather than installing real procedures Do fine-tuned language models actually learn optimization procedures?.

The thing you didn't know you wanted to know: "accuracy" and "reasoning" are measuring different objects, and most benchmarks only watch one of them. The faithfulness work and the corrupted-trace work together imply that a model can get more right while understanding less — and that the chain-of-thought you're reading may be a story told after the answer was already decided, not the path that reached it. If you want a doorway into measuring the gap directly, the Information Gain framing is the sharpest tool here Does supervised fine-tuning improve reasoning or just answers?.

Sources 11 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does supervised fine-tuning improve accuracy while damaging the quality of reasoning?

Sources 11 notes

Next inquiring lines