Does critiquing errors teach deeper understanding than imitating correct answers?
Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
Supervised Fine-Tuning trains models to maximize the probability of a correct response given an instruction. Critique Fine-Tuning (CFT) trains models to maximize the probability of a high-quality critique given an instruction plus a noisy (flawed) response. The training objective is P(critique | query, flawed_response). At inference time, the trained model generates direct responses in the normal way — no critique is invoked.
The advantage is mechanistic: to write a good critique, the model must understand the problem at a structural level — not just recognize the correct answer pattern but identify precisely what is wrong with a given response and why. This requires engaging with failure modes, understanding the criteria for correctness, and reasoning about deviations from those criteria. SFT can succeed by learning to recognize the surface form of correct answers. CFT cannot succeed by surface matching alone.
The training data is efficiently generated: GPT-4o produces critiques for query-noisy-response pairs at scale. The cost is that at least 20% of critiques contain errors (acknowledged limitation). But even imperfect critique supervision outperforms correct-response imitation, which reveals how weak the imitation objective is at building understanding.
The key limitation is illuminating: CFT-trained models can critique other models' outputs but do not develop self-critique capability. The training objective creates a competence asymmetry — better at evaluating others, not better at evaluating themselves. This is consistent with Why do models trust their own generated answers?: the self-trust structural bias persists even after extensive critique training on others' outputs.
This connects to Does chain-of-thought reasoning reveal genuine inference or pattern matching?: both identify the same SFT failure mode. CFT addresses the root: instead of training on correct form, train on structured failure analysis.
Source: Reasoning by Reflection
Related concepts in this collection
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
SFT imitation is the failure; CFT is an alternative training objective that forces structural understanding over form imitation
-
Why do models trust their own generated answers?
Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.
CFT's self-critique limitation confirms structural self-trust bias persists even when critique competence is developed for other-model evaluation
-
Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
CFT is the counter-strategy: instead of training on correct answer form (which raises scores without understanding), CFT trains on structured failure analysis (which requires understanding)
-
Do critique models improve diversity during training itself?
Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
complementary critique mechanism: AutoMathCritique uses critique to improve training-time exploration diversity; CFT uses critique-writing as the training signal itself; both treat critique as more than test-time quality filter
-
Can adversarial training replace task-specific verifiers for reasoning?
Does an adversarial game between policy and critic provide sufficient reward signal for reasoning tasks when ground-truth verifiers don't exist? This matters because most reasoning domains lack verifiers but have abundant expert demonstrations.
parallel mechanism: RARO's adversarial critic forces genuine reasoning for the same reason CFT's critique objective does — discriminating expert from policy requires structural understanding, not surface pattern matching; both bypass pure imitation
-
Can reasoning emerge from expert demonstrations alone?
Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.
RARO's co-trained critic operationalizes the critique principle via adversarial RL: the critic component develops evaluation capability through the same structural-understanding mechanism that makes CFT work, but in a joint training loop rather than a separate training objective
-
Can reasoning RL work without verifying generated answers?
Most reasoning RL methods require answer verification, limiting them to math and code. Can models be trained to reason better in domains like medicine and law where verification is impractical?
VeriFree extends critique-based training to domains without verifiers: where CFT trains on structured critique of flawed responses, VeriFree conditions on reference answer likelihood to create reward signal without explicit verification — both bypass the requirement for deterministic answer checking that limits standard RL to math/code
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
training to critique noisy responses produces deeper understanding than training to imitate correct responses