Reinforcement Learning for LLMs

Does RL training narrow search diversity the same way it does reasoning?

Exploring whether the entropy collapse pattern observed in reasoning RL also appears in search agent training. Understanding this helps identify whether diversity loss is a general RL property or domain-specific.

Note · 2026-02-21 · sourced from Deep Research

The "RL Squeezes, SFT Expands" paper studies search agents trained with RL versus SFT and finds the same pattern that the reasoning literature documented: RL training compresses the diversity of behaviors the agent explores (squeezes), while SFT on diverse demonstrations expands it. Since Does policy entropy collapse limit reasoning performance in RL?, and since this paper shows the same dynamic in search RL, entropy collapse is not a quirk of reasoning training — it is a property of RL training at large.

The mechanism is the same in both domains: RL rewards the policy for high-reward outputs and penalizes low-reward ones. Over training, the policy concentrates probability mass on the reward-maximizing region of its action space. In reasoning, this means converging on a narrow set of reasoning patterns. In search, it means converging on a narrow set of query strategies. Both reduce the agent's ability to explore novel approaches to hard problems.

SFT has the opposite effect because it trains on human demonstrations or diverse synthetic completions — the diversity of the training set is preserved in the policy. The tradeoff is that SFT cannot generalize beyond its demonstrations in the same way RL can.

This finding has practical implications for DR agent design: RL-trained search agents need explicit diversity mechanisms (entropy regularization, diverse reward models, periodic SFT refreshes) or they will converge on query templates that work well on average but fail on distribution shift. The same Do critique models improve diversity during training itself? remedy applies — external critique prevents the RL agent from collapsing to a narrow search strategy.


Source: Deep Research

Related concepts in this collection

Concept map
12 direct connections · 93 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl training for search agents squeezes exploration diversity while sft expands it — the same entropy collapse dynamic operates in search as in reasoning