Reasoning and Learning Architectures Reasoning and Knowledge

Do fine-tuned language models actually learn optimization procedures?

Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.

Note · 2026-05-18 · sourced from Reasoning Architectures

The constraint-optimization study uses a clean diagnostic to separate procedure from pattern: an N-case test set (in-distribution power-grid topologies) and an N-1 test set (the same problems with one element removed, putting them out of distribution while keeping the structure recognizable). A model running the actual procedure should perform comparably on both. A model running pattern-match should perform worse on N-1.

Even under GRPO and constraint-satisfaction-reward training, models degrade markedly on N-1. The conclusion is that RL on outcome-based rewards does not install the missing procedure — it sharpens the template-matching strategy along the in-distribution axis. The model gets better at recognizing patterns it has seen and worse, relatively, at adapting to perturbed structure.

This is methodologically important because it provides a probe that other reasoning evaluations lack. Most benchmarks cannot distinguish "the model solved this" from "the model recognized this." The N / N-1 comparison forces the distinction by holding the problem class fixed while perturbing the instance. The drop is the memorization signature.

For practitioners, the diagnostic generalizes. Wherever a deployment cares whether a model is computing or recalling — clinical reasoning, legal-statute reasoning, scientific problem-solving — building an "N-1" counterpart of the canonical test set is a cheap way to surface memorization. The structure-shift probe is more informative than headline accuracy on the canonical set.

Related concepts in this collection

Concept map
12 direct connections · 110 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

N-1 out-of-distribution tests reveal that RL fine-tuned LLMs still rely on memorization for optimization problems