Reasoning and Learning Architectures Reasoning and Knowledge

Does supervised fine-tuning actually improve reasoning on optimization problems?

When SFT boosts benchmark scores on constraint-optimization tasks, does it genuinely improve the model's ability to find feasible solutions, or just its ability to format answers convincingly?

Note · 2026-05-18 · sourced from Reasoning Architectures

The constraint-optimization study runs a controlled comparison between SFT and RL (with constraint-satisfaction rewards) on the same problem class. The SFT result is the diagnostic of interest: SFT clearly improves the form of the answer — JSON structure, decimal places, valid identifiers, expected sections — without improving the feasibility of the answer against the actual physical constraints. The model learns to look like it is solving the problem.

This is the formatting-vs-feasibility gap, and it is a specific instance of a more general SFT failure mode. SFT trains the model to reproduce the surface features of correct demonstrations. The surface features of a feasible solution and the surface features of a confidently-wrong solution are nearly identical. SFT optimizes the loss on the visible tokens, not on whether those tokens encode a valid physical state. The result is fluently presented infeasibility.

RL with feasibility-targeted rewards moves the needle modestly on actual feasibility, because the reward signal directly penalizes the constraint violations that SFT could not see. This is a real but limited gain — it does not break the 55-60% plateau, but it disambiguates which kind of failure SFT was leaving uncorrected.

The methodological implication for fine-tuning practice: when the desired behavior involves correctness in a dimension the loss does not measure, SFT improvements should be treated with suspicion. A clean rise in benchmark score where the benchmark scores presentation rather than substance can simply mean the model has gotten better at looking right.

Related concepts in this collection

Concept map
14 direct connections · 136 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

SFT improves response formatting but not physical feasibility — formatting wins mask reasoning shortcuts