Reinforcement Learning for LLMs

Why does RLVR training narrow a model's problem solving ability?

RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.

Note · 2026-02-22 · sourced from RLVR
How do domain training techniques actually reshape model behavior? How should researchers navigate LLM reasoning research? What does reward learning actually do to model reasoning?

RLVR faces a fundamental challenge: the solution space of LLMs is so vast and sparse that current techniques cannot guide effective exploration of unknown pathways. Long reasoning tasks are especially vulnerable — a single erroneous step nullifies the reward for the entire trajectory, failing to provide any positive signal for acquiring new knowledge. The result is "capability boundary collapse": the model's exploratory range contracts, and its problem-solving scope narrows rather than expands.

The mechanism parallels an educational insight: a model that only "thinks" (exploits internal knowledge) without "learning" (exploring external knowledge) will be "in peril." RLVR excels at inward exploitation — refining and optimizing already-known reasoning methods — but demonstrates inadequacy in outward exploration — discovering reasoning paths that the current policy assigns low probability to.

RL-PLUS addresses this with two components. Multiple Importance Sampling combines information from multiple policies to provide low-variance, unbiased reward estimation from external data — avoiding both the systematic bias of on-policy approaches and the high variance of naive off-policy corrections. An Exploration-Based Advantage Function reshapes the learning objective by up-weighting advantages for reasoning paths that are correct but have low probability under the current policy — explicitly incentivizing discovery of valuable information the model would typically overlook.

Since Does policy entropy collapse limit reasoning performance in RL?, capability boundary collapse is the downstream consequence of entropy collapse at the task-capability level. Entropy collapse constrains the token-level distribution; capability boundary collapse constrains the problem-level distribution. Both are about the same fundamental dynamic: optimization pressure narrows the space faster than exploration can maintain it.

Since Why do specialized models fail outside their domain?, capability boundary collapse is the RL-specific mechanism behind domain capability cliffs. The model doesn't just specialize — it actively loses the ability to generalize.

The Invisible Leash: formal constraint. "The Invisible Leash" provides the theoretical grounding: RLVR is constrained by the base model's support — unable to sample solutions with zero initial probability — and operates as a conservative reweighting mechanism that restricts discovery of entirely original solutions. The entropy-reward tradeoff is formalized: while RLVR reliably enhances pass@1 precision, the shrinkage of empirical support generally outweighs the expansion under larger sampling budgets. A subtle finding: RLVR sometimes increases token-level entropy (greater uncertainty at each generation step) while decreasing answer-level entropy (convergence onto fewer distinct answers). These seemingly more uncertain paths ultimately converge onto a smaller set of solutions. Breaking this invisible leash requires explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.


Source: RLVR; enriched from Flaws

Related concepts in this collection

Concept map
14 direct connections · 129 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

capability boundary collapse in rlvr narrows the models problem-solving scope — external data integration via importance sampling counteracts it