Can models learn better from critiquing errors than imitating correct responses?

This explores whether training a model to find what's wrong with a flawed answer teaches it more than feeding it correct answers to copy — and why imitation turns out to be a surprisingly weak teacher.

This explores whether training a model to find what's wrong with a flawed answer teaches it more than feeding it correct answers to copy. The corpus comes down fairly clearly on the side of critique. The most direct evidence is that training models to critique noisy responses produces deeper understanding than imitation-based training Does critiquing errors teach deeper understanding than imitating correct answers? — and strikingly, even *imperfect* critique supervision beats correct-answer imitation, because critiquing forces the model to engage with the structure of how reasoning fails rather than just memorizing the shape of a right answer.

The reason critique wins is clearer once you see how badly imitation underperforms. Models trained to imitate a stronger model (like ChatGPT) pick up its confident, fluent *style* without closing any real capability gap Can imitating ChatGPT fool evaluators into thinking models improved? — they fool human evaluators while staying just as wrong on novel tasks. The same surface-pattern trap shows up in argument-quality assessment, where fine-tuning on labeled correct examples teaches models superficial cues instead of the underlying criteria, and only explicit theoretical frameworks fix it Can models learn argument quality from labeled examples alone?. Imitation, in other words, is good at copying what an answer looks like and bad at transmitting why it's right.

But the corpus adds an important wrinkle: *how* you let a model learn from errors matters enormously. Simply showing it static traces of mistakes-and-corrections fails — offline self-correction data collapses because the training errors don't match the errors the model actually makes at test time, so it only works when the model practices correcting its *own* live mistakes under online RL Why does self-correction training on offline data fail?. The same theme runs through Reflexion, where agents learn from failure by writing verbal self-diagnoses into memory, and the unambiguous success/failure signal is what stops them from rationalizing Can agents learn from failure without updating their weights?. Error-based learning works when the error is the model's own and the feedback is honest.

There's a catch that complicates the whole picture, though. Models have a built-in bias toward trusting their own outputs — high-probability answers they generated simply *feel* more correct, which sabotages naive self-critique unless you force comparison against outside alternatives Why do models trust their own generated answers?. And critique can be internalized as a trainable skill rather than an inference-time cost: post-completion learning trains a model to compute its own evaluation in the unused space after its answer Can models learn to evaluate their own work during training?. The two paired together suggest critique is most powerful when it's structurally built in, not bolted on.

The deepest payoff sits a little sideways to the question. One finding shows that deliberately corrupted reasoning traces train models about as well as correct ones Do reasoning traces need to be semantically correct? — which implies that for imitation, the *content* of the right answer barely matters; the trace is just computational scaffolding. That's the quiet thesis tying the corpus together: imitation transmits form, critique transmits understanding. If correct answers can be replaced with corrupted ones and nothing breaks, you were never really learning from their correctness in the first place.

Sources 8 notes

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about error-based vs. imitation-based learning in LLMs. The question remains open: **Do models learn deeper reasoning by critiquing errors than by imitating correct responses?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. A curated library reported:
- Training on critique of noisy responses outperforms imitation of correct answers, even when critique itself is imperfect (2025-01, arXiv:2501.17703).
- Models trained via imitation capture fluent *style* without closing capability gaps; they fool human raters while failing on novel tasks (~2023, arXiv:2305.15717).
- Self-correction only works when models correct their *own* errors under online RL, not from static offline correction traces due to distribution mismatch (2024–2025).
- Models have an inherent bias toward trusting their own high-probability outputs, sabotaging naive self-critique unless forced to compare against alternatives (2024-03, arXiv:2403.09972).
- Deliberately corrupted reasoning traces train models comparably to correct ones, suggesting imitation transmits computational *form*, not semantic understanding (2025-05, arXiv:2505.13775).

Anchor papers (verify; mind their dates):
- arXiv:2501.17703 (2025-01): Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
- arXiv:2305.15717 (2023-05): The False Promise of Imitating Proprietary LLMs
- arXiv:2403.09972 (2024-03): Think Twice Before Trusting: Self-Detection for Large Language Models
- arXiv:2507.20252 (2025-07): Post-Completion Learning for Language Models

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above—especially the claim that critique beats imitation and that corrupted traces ≈ correct traces—determine whether newer scaling, instruction-tuning refinements, novel RL formulations, or multi-agent orchestration have since *relaxed* these constraints or *flipped* their implications. Is the online/offline mismatch still a hard barrier? Does self-detection bias still sabotage critique in the latest models? Separate the durable insight (critique engages reasoning structure vs. form-copying) from the perishable limitation (e.g., "only online RL works").
(2) **Surface the strongest contradicting or superseding work** from the last ~6 months that challenges the critique-over-imitation thesis or reframes when imitation actually *does* learn deep patterns.
(3) **Propose 2 research questions** that assume the learning regime may have shifted—e.g., whether hybrid critique–imitation curricula now outperform either alone, or whether structural critique (built into architecture) has made inference-time critique obsolete.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can models learn better from critiquing errors than imitating correct responses?

Sources 8 notes

Next inquiring lines