How does critique fine-tuning on one problem unlock broader reasoning?

This explores why training a model to critique solutions to just one problem activates general reasoning — and what that reveals about how reasoning gets 'turned on' versus 'taught.'

This explores how critique fine-tuning on a single problem unlocks broader reasoning ability — and the corpus frames this less as teaching new skills than as flipping a switch on capabilities the model already had. The headline result is striking: a model can reach reasoning performance comparable to reinforcement-learning approaches (RLVR) after being trained on teacher-generated critiques of varied solutions to just *one* problem, with no reinforcement learning at all Can a single problem unlock reasoning through solution critique?. The mechanism isn't accumulating problem coverage; it's exposure to the *contrast* between correct and incorrect reasoning. Once a model engages with why a solution fails, that activation generalizes.

Why would critiquing beat the more obvious route of training on correct answers? Because imitation teaches surface patterns, while critique forces engagement with failure modes and the structure of reasoning itself Does critiquing errors teach deeper understanding than imitating correct answers?. This is the lateral key to the whole question: the corpus repeatedly shows that learning to produce right answers and learning to *reason* are different things, and standard fine-tuning quietly optimizes the wrong one. Supervised fine-tuning raises benchmark accuracy while cutting the informativeness of reasoning steps by nearly 39% — models reach correct answers through post-hoc rationalization rather than genuine inference Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. On optimization tasks, SFT makes outputs *look* correct — proper structure, valid formatting — without making them actually feasible Does supervised fine-tuning actually improve reasoning on optimization problems?. And fine-tuning can sever the causal link between a model's stated reasoning steps and its final answer, so the chain-of-thought becomes performative decoration Does fine-tuning disconnect reasoning steps from final answers?. Against that backdrop, critique works precisely because it can't be faked by pattern-matching: you have to model the reasoning to judge it.

There's a second, subtler payoff that shows up when critique moves into the training loop. Step-level critique during self-training counteracts 'tail narrowing' — the tendency of models to prematurely collapse onto a single solution style — and maintains exploration diversity across iterations Do critique models improve diversity during training itself?. That matters because a recurring failure of reasoning models isn't lack of compute but structural disorganization: they wander into invalid paths and abandon promising ones too early ('underthinking') Why do reasoning models abandon promising solution paths?. Critique helps keep the solution space open rather than letting the model converge on the first plausible-looking route.

The deeper reason a single problem suffices: the bottleneck for reasoning is often activation, not knowledge. Trace length, for instance, tracks how close a problem sits to the training distribution rather than its actual difficulty — models are recalling schemas, not adaptively computing Does longer reasoning actually mean harder problems?. And more reasoning is not strictly better: accuracy peaks then declines past a thinking-token threshold as models overthink easy problems Does more thinking time always improve reasoning accuracy?. If the latent ability is already present and just needs the right organizing signal, then a concentrated dose of correct-vs-incorrect contrast on one problem can plausibly unlock it — which is exactly what the single-problem critique result demonstrates.

If you want to follow this thread further, the corpus connects it to other 'unlock without retraining' findings: reasoning verbosity turns out to be a single steerable direction in activation space, adjustable with no fine-tuning at all Can we steer reasoning toward brevity without retraining?, and structured abstractions can enforce breadth-first exploration that depth-only chains miss Can abstractions guide exploration better than depth alone?. The common thread across all of them: good reasoning is frequently something you *elicit* from a capable model, not something you have to pour in.

Sources 12 notes

Can a single problem unlock reasoning through solution critique?

Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

How does critique fine-tuning on one problem unlock broader reasoning?

Sources 12 notes

Next inquiring lines