INQUIRING LINE

Why does critique training produce deeper understanding than imitation training?

This explores why teaching a model to critique flawed answers builds deeper reasoning than teaching it to copy correct ones — and what 'deeper' actually means once you look at what each method transfers.


This explores why teaching a model to critique flawed answers builds deeper reasoning than teaching it to copy correct ones. The corpus has a sharp answer: imitation mostly transfers surface, while critique forces engagement with the machinery of reasoning. When you train a model on correct answers, what it picks up is often style and output format, not understanding. Models imitating ChatGPT learn to sound confident and fluent without closing any real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?, and instruction tuning turns out to teach the *shape* of the output space rather than the task itself — models trained on semantically empty or even wrong instructions perform almost as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. Imitation is a weak teacher because there's an easy shortcut: mimic the surface and you get rewarded.

Critique removes that shortcut. To critique a noisy response you have to engage with *why* it fails — the structural reasoning, the failure modes — and that engagement is what builds genuine understanding. Strikingly, even imperfect critique supervision beats correct-answer imitation Does critiquing errors teach deeper understanding than imitating correct answers?. The signal is so strong that critique fine-tuning on a *single* problem, using a teacher's critiques of varied solutions, can unlock reasoning comparable to full reinforcement learning — exposure to correct-versus-incorrect on one problem is a sufficient activation signal Can a single problem unlock reasoning through solution critique?.

Here's the part you might not expect: the deeper benefit of critique shows up *during* training, not just at test time. Step-level critique in the training loop counteracts 'tail narrowing' — the way self-training tends to collapse onto a few solution patterns — and keeps the model's exploration diverse Do critique models improve diversity during training itself?. Imitation pushes a model to converge prematurely on what it's already seen; critique keeps the search space open so the model can keep discovering.

This connects to a broader theme in the corpus about what reasoning training actually transfers. Models learn the *logical architecture* of reasoning — how steps sequence and connect — far more than factual content: they shrug off 50% corrupted numbers but break when you shuffle the steps What do models actually learn from chain-of-thought training?. And training on messy *search processes*, including mistakes and backtracking, produces substantially better problem-solvers than training only on clean optimal trajectories, because the model internalizes how to explore rather than a fixed path Does training on messy search processes improve reasoning?. Critique belongs to this family: engaging with error and structure beats copying the polished final product.

Worth noting the corpus doesn't frame this as critique-versus-imitation winner-take-all. Sequencing matters — establishing reasoning foundations through imitation first, then sharpening against verifiable rewards, beats either alone, because the imitation phase creates reasonable rollouts that later exploration can refine Does sequencing imitation then exploration training improve reasoning?. Imitation isn't useless; it's a weak teacher of understanding but a fine way to bootstrap a starting point. The deeper lesson: understanding comes from grappling with what's wrong and why, and any training that hands the model only correct answers quietly lets it skip that work.


Sources 8 notes

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can a single problem unlock reasoning through solution critique?

Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

What do models actually learn from chain-of-thought training?

Controlled ablations show models tolerate 50% corrupted numbers (3.2% accuracy loss) but fail under step shuffling (13.3% loss). What distills across reasoning demonstrations is logical architecture—how steps sequence and connect—not factual accuracy.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Next inquiring lines