Reinforcement Learning for LLMs LLM Reasoning and Architecture

Do critique models improve diversity during training itself?

Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The intuitive framing of critique models is that they help at test time: the model generates, the critic scores, we select the best. But the more important finding from AutoMathCritique is that critique integrated into the training loop improves the actor model's exploration efficiency and solution diversity during training itself.

Without critique in the loop, iterative self-training suffers from "tail narrowing" — the model converges on a narrow distribution of solutions, becoming less able to explore diverse reasoning paths. The critique model counteracts this: by providing step-level feedback on exploration, it guides the actor toward high-quality paths it wouldn't have discovered alone, maintaining distributional breadth through training.

This connects to Does policy entropy collapse limit reasoning performance in RL?: critique models are a way to maintain entropy — the exploration needed for continued improvement — without relying solely on architectural entropy management (Clip-Cov, KL-Cov). The critique is an external signal that prevents premature convergence.

The implication: critique models are training infrastructure as much as inference infrastructure. Evaluating them only on test-time accuracy misses their more fundamental role.


Source: Test Time Compute

Related concepts in this collection

Concept map
16 direct connections · 136 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

critique models improve exploration diversity during training not just test-time accuracy