Do fine-tuned language models actually learn optimization procedures?
Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.
The constraint-optimization study uses a clean diagnostic to separate procedure from pattern: an N-case test set (in-distribution power-grid topologies) and an N-1 test set (the same problems with one element removed, putting them out of distribution while keeping the structure recognizable). A model running the actual procedure should perform comparably on both. A model running pattern-match should perform worse on N-1.
Even under GRPO and constraint-satisfaction-reward training, models degrade markedly on N-1. The conclusion is that RL on outcome-based rewards does not install the missing procedure — it sharpens the template-matching strategy along the in-distribution axis. The model gets better at recognizing patterns it has seen and worse, relatively, at adapting to perturbed structure.
This is methodologically important because it provides a probe that other reasoning evaluations lack. Most benchmarks cannot distinguish "the model solved this" from "the model recognized this." The N / N-1 comparison forces the distinction by holding the problem class fixed while perturbing the instance. The drop is the memorization signature.
For practitioners, the diagnostic generalizes. Wherever a deployment cares whether a model is computing or recalling — clinical reasoning, legal-statute reasoning, scientific problem-solving — building an "N-1" counterpart of the canonical test set is a cheap way to surface memorization. The structure-shift probe is more informative than headline accuracy on the canonical set.
Related concepts in this collection
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
same paper, the mechanism this diagnostic exposes
-
Do larger language models solve constrained optimization better?
Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.
same paper, the parent finding
-
Does supervised fine-tuning actually improve reasoning on optimization problems?
When SFT boosts benchmark scores on constraint-optimization tasks, does it genuinely improve the model's ability to find feasible solutions, or just its ability to format answers convincingly?
same paper, complementary memorization signature
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
N-1 out-of-distribution tests reveal that RL fine-tuned LLMs still rely on memorization for optimization problems