Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does critiquing errors teach deeper understanding than imitating correct answers?

Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.

Note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models? How should researchers navigate LLM reasoning research?

Supervised Fine-Tuning trains models to maximize the probability of a correct response given an instruction. Critique Fine-Tuning (CFT) trains models to maximize the probability of a high-quality critique given an instruction plus a noisy (flawed) response. The training objective is P(critique | query, flawed_response). At inference time, the trained model generates direct responses in the normal way — no critique is invoked.

The advantage is mechanistic: to write a good critique, the model must understand the problem at a structural level — not just recognize the correct answer pattern but identify precisely what is wrong with a given response and why. This requires engaging with failure modes, understanding the criteria for correctness, and reasoning about deviations from those criteria. SFT can succeed by learning to recognize the surface form of correct answers. CFT cannot succeed by surface matching alone.

The training data is efficiently generated: GPT-4o produces critiques for query-noisy-response pairs at scale. The cost is that at least 20% of critiques contain errors (acknowledged limitation). But even imperfect critique supervision outperforms correct-response imitation, which reveals how weak the imitation objective is at building understanding.

The key limitation is illuminating: CFT-trained models can critique other models' outputs but do not develop self-critique capability. The training objective creates a competence asymmetry — better at evaluating others, not better at evaluating themselves. This is consistent with Why do models trust their own generated answers?: the self-trust structural bias persists even after extensive critique training on others' outputs.

This connects to Does chain-of-thought reasoning reveal genuine inference or pattern matching?: both identify the same SFT failure mode. CFT addresses the root: instead of training on correct form, train on structured failure analysis.


Source: Reasoning by Reflection

Related concepts in this collection

Concept map
18 direct connections · 143 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

training to critique noisy responses produces deeper understanding than training to imitate correct responses