Does policy entropy collapse prevent inference-time search from finding solutions?

This explores whether the loss of diversity a policy suffers during RL training (entropy collapse) starves inference-time search of the varied candidates it needs to explore — connecting a training-time pathology to a deployment-time capability.

This reads the question as a bridge between two stages that are usually discussed separately: entropy collapse is something that happens during RL *training*, while search happens at *inference*. The corpus suggests the link is real and direct — search can only explore the behaviors the policy is still capable of producing, and entropy collapse quietly shrinks that set before search ever runs.

The foundation is the finding that policy entropy collapse is the primary ceiling on RL-trained reasoning: performance saturates as entropy approaches zero, following a clean empirical law where the model converges on a few high-reward strategies and stops varying its output Does policy entropy collapse limit reasoning performance in RL?. The same mechanism shows up in search agents specifically — RL training compresses behavioral diversity, pushing policies onto narrow reward-maximizing trajectories, and supervised fine-tuning on diverse demonstrations is what restores exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. So the answer to the literal question is: not by blocking the search procedure itself, but by hollowing out the candidate distribution it samples from. A collapsed policy will keep proposing minor variations of the same path, so wider search budget yields less.

The most telling evidence is what happens when diversity is *deliberately* protected at inference time. Mind Evolution runs an island-model genetic search whose entire advantage is that it sustains population diversity, and it beats Best-of-N and sequential revision precisely because those methods suffer 'premature convergence' — the inference-time analogue of entropy collapse Can evolutionary search beat sampling and revision at inference time?. The same logic motivates making latent reasoning stochastic: replacing deterministic updates with sampled transitions lets a model hold several candidate strategies at once instead of committing early to one Can stochastic latent reasoning help models explore multiple solutions?. Both are essentially diversity-preservation techniques aimed at the same failure the entropy-collapse work documents.

There's a useful caveat in the scaling-law material. Search budget follows a test-time scaling curve much like reasoning tokens, with monotonic-but-diminishing returns Do search steps follow the same scaling rules as reasoning tokens? Does search budget scale like reasoning tokens for answer quality?. Those diminishing returns are exactly what you'd expect if the underlying policy is narrowing — more search keeps surfacing the same answers. And the reminder that training regime matters more than inference compute reinforces the direction of causation: you cannot fully buy back at inference time what was lost during training Can non-reasoning models catch up with more compute?.

The thing worth taking away: 'inference-time search' and 'policy entropy' are not independent knobs. Search is a magnifier of whatever diversity the policy retains, so the real lever is upstream — entropy-management interventions during training (Clip-Cov, KL-Cov, GPPO) or diversity-preserving search structures (islands, stochastic latents) are two ends of the same problem, and the most robust systems will likely need both.

Sources 7 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can evolutionary search beat sampling and revision at inference time?

Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Does policy entropy collapse prevent inference-time search from finding solutions?

Sources 7 notes

Next inquiring lines