Why does external critique improve revision while internal self-assessment fails?

This explores why a model that gets feedback from an outside source revises well, while a model grading its own work tends to make things worse — and what the corpus says is actually doing the work.

This explores why external critique improves revision while internal self-assessment fails — and the corpus's sharpest answer is that the *act* of revising isn't what helps or hurts; the *source* of the critique is. One study makes this almost surgically clear: revision guided by an external model raises accuracy, but a model revising its own uncertain output usually just amplifies confidence in the wrong answer rather than fixing it Does revising your own reasoning actually help or hurt?. Self-revision in strong reasoning models (QwQ, R1, LIMO) mostly preserves wrong answers, and smaller models frequently flip correct answers to incorrect — longer chains with more revisions actually correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?.

The mechanism behind the failure has a name: degeneration of thought. When a model reconsiders an answer using its own prior reasoning, it doesn't have an independent vantage point — it's checking its work against the same flawed prior that produced the error, so it converges toward false confidence instead of away from it Does a model improve by arguing with itself?. The fix in that same work is telling: replace the single self with *genuinely different* models in debate, and the pattern reverses — both accuracy and calibration improve. Difference, not introspection, is the active ingredient.

This is why "pure" self-improvement keeps hitting a wall. One synthesis argues that methods which look self-contained almost always smuggle in an external anchor — a past model version, a third-party judge, a user correction, a tool's output — because unaided self-improvement stalls on the generation–verification gap, diversity collapse, and reward hacking Can models reliably improve themselves without external feedback?. The deep problem is that a model's ability to *verify* an answer isn't reliably better than its ability to *generate* one, so it has no leverage to correct itself from the inside.

What makes external critique different isn't just that it catches errors at test time. Critique signal injected during training counteracts "tail narrowing" — it keeps the model's solution space diverse instead of prematurely collapsing onto its favorite answer Do critique models improve diversity during training itself?. And training a model to *critique* noisy responses produces deeper understanding than training it to imitate correct ones, because critique forces engagement with how things fail rather than copying surface patterns Does critiquing errors teach deeper understanding than imitating correct answers?. That connects to a quieter finding worth knowing: imitation training captures a confident, fluent *style* without closing any real capability gap Can imitating ChatGPT fool evaluators into thinking models improved? — and self-assessment that flatters its own style is the same trap viewed from inside.

The interesting wrinkle is that internal self-assessment isn't doomed in principle — it's doomed when it has nothing external to ground it. Approaches that get self-judging to work do so by manufacturing an outside-like signal: SERL has the model alternate between generating and *ranking* responses, deriving reward from the consistency between independent judgments rather than from a single self-endorsement Can models learn to judge themselves without external rewards?, and Post-Completion Learning trains self-evaluation in unused sequence space so the model internalizes an evaluation function rather than rubber-stamping its first output Can models learn to evaluate their own work during training?. The throughline across all of it: revision works when something independent — a different model, a ranking, a held-out judge — breaks the loop of a mind checking itself against itself.

Sources 9 notes

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Why does external critique improve revision while internal self-assessment fails?

Sources 9 notes

Next inquiring lines