Can models learn better from critiquing errors than imitating correct responses?
This explores whether training a model to find what's wrong with a flawed answer teaches it more than feeding it correct answers to copy — and why imitation turns out to be a surprisingly weak teacher.
This explores whether training a model to find what's wrong with a flawed answer teaches it more than feeding it correct answers to copy. The corpus comes down fairly clearly on the side of critique. The most direct evidence is that training models to critique noisy responses produces deeper understanding than imitation-based training Does critiquing errors teach deeper understanding than imitating correct answers? — and strikingly, even *imperfect* critique supervision beats correct-answer imitation, because critiquing forces the model to engage with the structure of how reasoning fails rather than just memorizing the shape of a right answer.
The reason critique wins is clearer once you see how badly imitation underperforms. Models trained to imitate a stronger model (like ChatGPT) pick up its confident, fluent *style* without closing any real capability gap Can imitating ChatGPT fool evaluators into thinking models improved? — they fool human evaluators while staying just as wrong on novel tasks. The same surface-pattern trap shows up in argument-quality assessment, where fine-tuning on labeled correct examples teaches models superficial cues instead of the underlying criteria, and only explicit theoretical frameworks fix it Can models learn argument quality from labeled examples alone?. Imitation, in other words, is good at copying what an answer looks like and bad at transmitting why it's right.
But the corpus adds an important wrinkle: *how* you let a model learn from errors matters enormously. Simply showing it static traces of mistakes-and-corrections fails — offline self-correction data collapses because the training errors don't match the errors the model actually makes at test time, so it only works when the model practices correcting its *own* live mistakes under online RL Why does self-correction training on offline data fail?. The same theme runs through Reflexion, where agents learn from failure by writing verbal self-diagnoses into memory, and the unambiguous success/failure signal is what stops them from rationalizing Can agents learn from failure without updating their weights?. Error-based learning works when the error is the model's own and the feedback is honest.
There's a catch that complicates the whole picture, though. Models have a built-in bias toward trusting their own outputs — high-probability answers they generated simply *feel* more correct, which sabotages naive self-critique unless you force comparison against outside alternatives Why do models trust their own generated answers?. And critique can be internalized as a trainable skill rather than an inference-time cost: post-completion learning trains a model to compute its own evaluation in the unused space after its answer Can models learn to evaluate their own work during training?. The two paired together suggest critique is most powerful when it's structurally built in, not bolted on.
The deepest payoff sits a little sideways to the question. One finding shows that deliberately corrupted reasoning traces train models about as well as correct ones Do reasoning traces need to be semantically correct? — which implies that for imitation, the *content* of the right answer barely matters; the trace is just computational scaffolding. That's the quiet thesis tying the corpus together: imitation transmits form, critique transmits understanding. If correct answers can be replaced with corrupted ones and nothing breaks, you were never really learning from their correctness in the first place.
Sources 8 notes
Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.