Does RL training narrow search diversity the same way it does reasoning?
Exploring whether the entropy collapse pattern observed in reasoning RL also appears in search agent training. Understanding this helps identify whether diversity loss is a general RL property or domain-specific.
The "RL Squeezes, SFT Expands" paper studies search agents trained with RL versus SFT and finds the same pattern that the reasoning literature documented: RL training compresses the diversity of behaviors the agent explores (squeezes), while SFT on diverse demonstrations expands it. Since Does policy entropy collapse limit reasoning performance in RL?, and since this paper shows the same dynamic in search RL, entropy collapse is not a quirk of reasoning training — it is a property of RL training at large.
The mechanism is the same in both domains: RL rewards the policy for high-reward outputs and penalizes low-reward ones. Over training, the policy concentrates probability mass on the reward-maximizing region of its action space. In reasoning, this means converging on a narrow set of reasoning patterns. In search, it means converging on a narrow set of query strategies. Both reduce the agent's ability to explore novel approaches to hard problems.
SFT has the opposite effect because it trains on human demonstrations or diverse synthetic completions — the diversity of the training set is preserved in the policy. The tradeoff is that SFT cannot generalize beyond its demonstrations in the same way RL can.
This finding has practical implications for DR agent design: RL-trained search agents need explicit diversity mechanisms (entropy regularization, diverse reward models, periodic SFT refreshes) or they will converge on query templates that work well on average but fail on distribution shift. The same Do critique models improve diversity during training itself? remedy applies — external critique prevents the RL agent from collapsing to a narrow search strategy.
Source: Deep Research
Related concepts in this collection
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends: entropy collapse is confirmed in the search domain; the bottleneck is architectural, not reasoning-specific
-
Do critique models improve diversity during training itself?
Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
applies: the diversity-preservation remedy generalizes to search RL; critique models prevent search strategy collapse
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
parallel RL emergence pattern: domain reasoning capabilities (AlphaMed) and search capabilities both emerge from RL reward signals; entropy collapse constrains scaling in both
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
algorithm-invariance evidence in reasoning and entropy collapse in search are the same mechanism from different angles: both show RL is bounded by the pretrained prior, not by optimizer choice
-
Does RL training collapse format diversity in pretrained models?
Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
the format-level selection mechanism: RL entropy collapse in search narrows strategy diversity within one distribution, while the echo chamber effect selects which pretraining distribution survives — format selection precedes and compounds within-format diversity loss
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
rl training for search agents squeezes exploration diversity while sft expands it — the same entropy collapse dynamic operates in search as in reasoning