Can entropy regularization or critique models prevent search strategy collapse during RL training?
This explores whether two specific interventions — entropy regularization and critique/feedback models — can stop an RL-trained search agent from narrowing onto a single rigid strategy, and the corpus suggests they attack two different parts of the same problem.
This explores whether you can keep an RL-trained search agent from collapsing onto one narrow strategy by either (a) actively managing the policy's entropy or (b) feeding it richer critique signals — and the corpus has material on both, treating them as complementary rather than competing fixes. First, it's worth knowing the collapse is real and not unique to search. RL training squeezes exploration diversity in search agents through the *same* mechanism documented in reasoning: policies converge on whatever maximizes reward and abandon the rest Does reinforcement learning squeeze exploration diversity in search agents?. That convergence has a measurable signature — performance saturates as policy entropy approaches zero, following an empirical law where you can almost predict the ceiling from the entropy curve Does policy entropy collapse limit reasoning performance in RL?. And the thing being collapsed onto isn't necessarily the *best* strategy: controlled experiments show RL amplifies a single dominant format inherited from pretraining within the first epoch, with the winner often determined by model scale rather than performance Does RL training collapse format diversity in pretrained models?.
On the entropy-regularization side, the answer is a qualified yes. The named interventions — Clip-Cov, KL-Cov, and GPPO — work by managing *how* entropy is reduced during training rather than letting it crater, preserving exploratory capacity and pushing back the performance ceiling Does policy entropy collapse limit reasoning performance in RL?. But there's a tell-tale catch the corpus surfaces: even without any explicit regularizer, RL only updates 5–30% of parameters, and those sparse updates are nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?. That structural narrowing suggests entropy management is fighting a strong built-in pull toward concentration — regularization slows the collapse, it doesn't reverse the underlying tendency.
The critique-model angle is the more interesting lateral move, because it changes *what information* the policy gets rather than just how widely it samples. The core diagnosis: numerical rewards are informationally thin — they tell the model it failed but not why or how to improve. Critique-GRPO shows that models frozen on a performance plateau start producing correct solutions once given chain-of-thought critiques instead of bare scalars Can natural language feedback overcome numerical reward plateaus?. Tree-search critics do something adjacent: AlphaLLM's three critic models derive dense, process-level quality signals that rank solution *paths*, which is exactly the granularity a search agent needs to know that a strategy is dead-ending before it commits Can tree search replace human feedback in LLM training?. A leaner variant reuses cross-rollout variance simultaneously as a reward signal and a query filter, throwing out degenerate comparisons and buying 2–3× faster, more stable training Can one statistical measure serve dual purposes in RL training?.
The synthesis worth carrying away: entropy regularization and critique models prevent collapse at different layers. Entropy methods keep the policy *sampling broadly* (a width problem); critique models keep it *learning the right thing from each sample* (a signal-quality problem). The two-phase view of RL training hints at why you might want both — early training is driven by execution correctness, but the later bottleneck is strategic exploration, where planning-token entropy actually needs to *rise* Does RL training follow a predictable two-phase learning sequence?. The unexpected coda is that the cleanest fix might sit upstream of either: SFT on diverse demonstrations preserves exploration breadth that RL then erodes Does reinforcement learning squeeze exploration diversity in search agents?, implying you prevent collapse partly by what you bank *before* RL begins, not only by what you regularize during it.
Sources 8 notes
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.