Reinforcement Learning for LLMs LLM Reasoning and Architecture

Do critique models improve diversity during training itself?

Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.

Note · 2026-02-20 · sourced from Test Time Compute

The intuitive framing of critique models is that they help at test time: the model generates, the critic scores, we select the best. But the more important finding from AutoMathCritique is that critique integrated into the training loop improves the actor model's exploration efficiency and solution diversity during training itself.

Without critique in the loop, iterative self-training suffers from "tail narrowing" — the model converges on a narrow distribution of solutions, becoming less able to explore diverse reasoning paths. The critique model counteracts this: by providing step-level feedback on exploration, it guides the actor toward high-quality paths it wouldn't have discovered alone, maintaining distributional breadth through training.

This connects to Does policy entropy collapse limit reasoning performance in RL?: critique models are a way to maintain entropy — the exploration needed for continued improvement — without relying solely on architectural entropy management (Clip-Cov, KL-Cov). The critique is an external signal that prevents premature convergence.

The implication: critique models are training infrastructure as much as inference infrastructure. Evaluating them only on test-time accuracy misses their more fundamental role.

Source: Test Time Compute

Related concepts in this collection

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
critique models as a mechanism against entropy collapse
Can natural language feedback overcome numerical reward plateaus? Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
concrete evidence: Critique-GRPO shows that CoT critiques break plateaus where 8x scaling of numerical rewards fails; the NLF mechanism works precisely because critiques expand the effective exploration space that numerical rewards cannot reach
Can diversity optimization improve quality during language model training? Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
DARLING provides the complementary mechanism: critique models maintain diversity by guiding exploration quality, while explicit semantic diversity optimization maintains diversity by directly rewarding distributional breadth — together they address the entropy collapse problem from both the feedback channel (critique) and the reward signal (diversity bonus)
Can a single problem unlock reasoning through diverse critiques? Does exposing models to many different critiques of one problem activate reasoning better than training on many different problems? This matters because it suggests data efficiency isn't the main constraint.
extends with extreme efficiency: CFT shows that diverse critiques on a *single* problem suffice for reasoning activation — the diversity-via-critique mechanism does not need a diverse problem distribution, only diverse critiques of the solution space; this is the strongest evidence for the "critique is training infrastructure" framing
Does critiquing errors teach deeper understanding than imitating correct answers? Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
extends to the training data design: training models on critiques of noisy responses produces deeper understanding than training on correct responses; the principle generalizes from "critique guides exploration" to "critique IS the training signal"

Concept map

16 direct connections · 136 in 2-hop network ·medium cluster

Do critique models improve diversity during trai… Does policy entropy collapse limit reasoning perfo… Can natural language feedback overcome numerical r… Can diversity optimization improve quality during … Can a single problem unlock reasoning through dive… Does critiquing errors teach deeper understanding …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

critique models improve exploration diversity during training not just test-time accuracy