Why does evaluating errors teach more than imitating correct responses?

This explores why training a model to judge, critique, or work through wrong answers builds deeper capability than simply showing it correct answers to copy — and what the corpus says about the mechanism behind that gap.

This explores why engaging with failure — critiquing it, filtering it, practicing through it — teaches a model more than mimicking polished correct responses, and the corpus converges on a single reason from several directions: imitation copies surface form, while error-engagement forces contact with the structure underneath. The cleanest statement comes from work showing that training a model to critique noisy responses produces deeper understanding than training it on correct answers, because critique forces engagement with failure modes and structural reasoning rather than letting the model coast on pattern-matching Does critiquing errors teach deeper understanding than imitating correct answers?. The flip side is just as sharp: models trained to imitate a stronger system learn its confident, fluent style well enough to fool human evaluators while closing no actual capability gap on novel tasks Can imitating ChatGPT fool evaluators into thinking models improved?. Imitation, in other words, optimizes for looking right, which is exactly the thing that hides whether you are right.

The same lesson recurs wherever researchers tried to teach 'quality' by example and watched it fail. Fine-tuning on labeled good-and-bad arguments doesn't transfer — models pick up surface cues instead of the principled criteria, and only explicit theoretical frameworks make the judgment generalize Can models learn argument quality from labeled examples alone?. Evaluating errors works because it can't be done by surface mimicry: to say *why* something is wrong you have to represent the standard it violates. That's the difference between recognizing a shape and understanding a rule.

More surprising is what the corpus reveals about errors as positive training signal rather than just things to avoid. One result keeps diverse failed trajectories as explicit negative signal while filtering only the successes for quality, letting a 14B model reach frontier math performance — the failures are doing real work Why do correct code trajectories teach models to tolerate errors?. Training on the full messy search process, including dead ends and backtracking, beats training on clean optimal solutions by 25%, because the model learns an internal world model for searching rather than a fixed route to copy Does training on messy search processes improve reasoning?. And self-correction simply cannot be taught from offline correction traces — the model has to practice on its *own* mistakes via online RL, because errors it generated are the only ones it will actually need to fix at test time Why does self-correction training on offline data fail?.

There's a deeper reason error-engagement matters that the corpus surfaces almost as a warning: models have a structural bias toward trusting whatever they themselves produced, because high-probability generated answers simply *feel* more correct, and breaking that self-agreement loop requires comparing against alternatives Why do models trust their own generated answers?. Pure imitation feeds that bias — it rewards producing fluent output and never forces a confrontation with being wrong. Evaluating errors is the antidote precisely because it makes the model take a position against a candidate answer instead of validating it.

The one genuinely destabilizing finding here is that even *corrupted* reasoning traces train as well as correct ones, suggesting that in some settings the trace functions as computational scaffolding rather than meaningful content Do reasoning traces need to be semantically correct?. Hold that next to the critique and search results and a subtler picture emerges: what helps isn't the correctness of the example you imitate — it's whether the training process forces the model to *do* something structural, whether that's judging, searching, or correcting. Copying correct answers asks for the least structure of all, which is exactly why it teaches the least.

Sources 8 notes

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Why does evaluating errors teach more than imitating correct responses?

Sources 8 notes

Next inquiring lines