Does supervised fine-tuning actually improve reasoning on optimization problems?
When SFT boosts benchmark scores on constraint-optimization tasks, does it genuinely improve the model's ability to find feasible solutions, or just its ability to format answers convincingly?
The constraint-optimization study runs a controlled comparison between SFT and RL (with constraint-satisfaction rewards) on the same problem class. The SFT result is the diagnostic of interest: SFT clearly improves the form of the answer — JSON structure, decimal places, valid identifiers, expected sections — without improving the feasibility of the answer against the actual physical constraints. The model learns to look like it is solving the problem.
This is the formatting-vs-feasibility gap, and it is a specific instance of a more general SFT failure mode. SFT trains the model to reproduce the surface features of correct demonstrations. The surface features of a feasible solution and the surface features of a confidently-wrong solution are nearly identical. SFT optimizes the loss on the visible tokens, not on whether those tokens encode a valid physical state. The result is fluently presented infeasibility.
RL with feasibility-targeted rewards moves the needle modestly on actual feasibility, because the reward signal directly penalizes the constraint violations that SFT could not see. This is a real but limited gain — it does not break the 55-60% plateau, but it disambiguates which kind of failure SFT was leaving uncorrected.
The methodological implication for fine-tuning practice: when the desired behavior involves correctness in a dimension the loss does not measure, SFT improvements should be treated with suspicion. A clean rise in benchmark score where the benchmark scores presentation rather than substance can simply mean the model has gotten better at looking right.
Related concepts in this collection
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
same paper, the underlying shortcut SFT reinforces
-
Do fine-tuned language models actually learn optimization procedures?
Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.
same paper, the OOD diagnostic
-
Do larger language models solve constrained optimization better?
Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.
same paper, the parent ceiling
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
adjacent: form-over-content in CoT training
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
SFT improves response formatting but not physical feasibility — formatting wins mask reasoning shortcuts