Does critique training improve exploration diversity during model training or only test time?
This explores whether training a model to critique its own work helps it keep exploring varied solutions *while it's learning* — or whether the payoff only shows up later as better answers at inference time.
This explores whether critique training improves *exploration diversity during the training loop itself* (not just final test accuracy). The corpus is unusually direct on this: critique models maintain solution diversity *during* self-training, counteracting the way models otherwise narrow toward a single dominant strategy as they iterate Do critique models improve diversity during training itself?. The key reframing there is that the training-time benefit — preventing premature convergence — is treated as *more fundamental* than the test-time accuracy bump everyone notices first. So the answer to the literal question is: during training, not only at test time.
What makes this interesting is *why* convergence is the enemy worth fighting. A cluster of notes documents the failure mode critique is pushing against. Outcome-based RL — rewarding only the final correct answer — sharpens the policy globally, and that diversity loss even bleeds from solved problems onto unsolved ones Does outcome-based RL diversity loss spread across unsolved problems?. RL training in search agents collapses behavioral diversity through the same entropy-collapse mechanism seen in reasoning Does reinforcement learning squeeze exploration diversity in search agents?, and RL can quietly converge on a single pretraining format within the first epoch Does RL training collapse format diversity in pretrained models?. Critique-in-the-loop is one antidote to that tail-narrowing; it's not the only one.
The lateral surprise is that critique seems to work not because it tracks diversity directly, but because it forces *engagement with failure*. Training a model to critique noisy, wrong responses produces deeper understanding than imitating correct answers — even imperfect critique supervision beats correct-answer imitation, because critique makes the model reason about *why* something fails rather than pattern-matching what succeeds Does critiquing errors teach deeper understanding than imitating correct answers?. And this activation can be astonishingly cheap: critique fine-tuning on a *single problem*, using teacher critiques of varied solutions, unlocks reasoning comparably to RLVR with no reinforcement learning at all Can a single problem unlock reasoning through solution critique?. That suggests the mechanism is exposure to the correct-vs-incorrect contrast, not reward-driven search.
If you zoom out, critique is one member of a family of "keep exploration alive" techniques, and the corpus shows they're not interchangeable. Directly rewarding semantic diversity during RL catalyzes exploration and *raises* quality rather than trading against it Can diversity optimization improve quality during language model training?. Training to emit many competent solutions (rather than one) pays off specifically when inference uses search Should training maximize diversity when models feed into search?, and abstraction-guided breadth-first exploration beats depth alone at large compute budgets Can abstractions guide exploration better than depth alone?. One note draws the sharpest distinction relevant to your question: training-time diversity (historical exploration, e.g. UCB-style bonuses) and test-time diversity (batch exploration, e.g. repetition penalties) require *structurally different mechanisms* Does outcome-based RL diversity loss spread across unsolved problems?. Critique lands on the training-time side of that line.
Two caveats worth carrying away. First, whether diversity is even desirable is domain-dependent: preference tuning *reduces* lexical diversity in code (where convergence on correctness is good) but *increases* it in creative writing Does preference tuning always reduce diversity the same way?, and training order mechanically reshapes these entropy dynamics across task types Does training order reshape how models handle different task types?. Second, the stakes for getting this right are bigger than one model: when diversity collapses across the board, different LLMs independently converge on near-identical outputs — an "Artificial Hivemind" that undermines the supposed benefits of ensembling many models Do different AI models actually produce diverse outputs?. Critique training, in that light, is partly a defense against everyone's models quietly becoming the same model.
Sources 12 notes
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.
Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.
DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.
Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.