Does sparsity in RL arise from training on policy-distribution data?

This explores whether the small slice of parameters RL actually changes (its 'sparsity') is a side-effect of RL learning from data the model itself generates — i.e. on-policy, in-distribution data — rather than from any explicit sparsity penalty.

This reads the question as: when RL touches only a sliver of a model's weights, is that because RL trains on the model's own output distribution rather than novel data? The corpus has a direct answer and a lateral one. Directly, RL really does update very few parameters — only 5–30% across seven algorithms and ten model families, with no regularization forcing it Does reinforcement learning update only a small fraction of parameters?. The striking detail is that these updates are nearly full-rank and nearly identical across random seeds. That rules out 'sparsity is arbitrary noise' — the model keeps returning to the same small subnetwork, which is exactly what you'd expect if the data it learns from is constrained to its own behavior rather than exploring fresh territory.

The link to policy-distribution data gets sharper when you look at the on-policy failure modes documented elsewhere. RLVR's on-policy constraint causes 'capability boundary collapse' — the model exploits what it already does well and avoids exploring, narrowing its problem-solving scope Why does RLVR training narrow a model's problem solving ability?. Relatedly, RL post-training amplifies a single dominant format already present in pretraining within the first epoch while suppressing the alternatives Does RL training collapse format diversity in pretrained models?. Both are descriptions of the same thing from the loss side: training on in-distribution rollouts concentrates change rather than spreading it. So the parameter sparsity in [1] and the distributional collapse in [3] and [6] look like two views of one mechanism — RL refines what's already in-policy instead of rewriting the model.

There's a deeper clue in how networks store the familiar. Representational density is *learned* through data familiarity — networks build dense activations for inputs they've seen and default to sparse representations for unfamiliar ones Is representational sparsity learned or intrinsic to neural networks?. Read alongside [1], this suggests sparsity isn't just an RL artifact; it's how models behave on familiar territory generally. Since on-policy RL data is by definition familiar (the model generated it), sparse updates are almost the predicted outcome rather than a surprise.

Where the corpus complicates a clean 'yes': several notes show the same on-policy diet also collapses *exploration*, not just parameters. RL squeezes behavioral diversity in search agents through entropy collapse, while SFT on diverse demonstrations preserves breadth Does reinforcement learning squeeze exploration diversity in search agents?, and policy entropy collapse is identified as the primary ceiling on RL reasoning gains Does policy entropy collapse limit reasoning performance in RL?. The lateral payoff: 'sparsity' and 'entropy collapse' may be the parameter-space and behavior-space signatures of the same root cause — learning from your own narrow distribution. That reframes the question. The interesting follow-up isn't whether on-policy data causes sparsity, but whether the structured, seed-stable sparsity in [1] is the *good* face of in-distribution training (efficient, consolidatable) while entropy collapse is the *bad* face (lost exploration) — and whether you can keep one without the other by injecting off-policy diversity, as the SFT and exploration-reward interventions in [4] and [6] attempt.

Sources 6 notes

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL-in-LLMs researcher re-testing whether on-policy data distribution *causes* sparse parameter updates in fine-tuning, or whether the link is correlation masking separate mechanisms.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and rest on three pillars:
• RL updates only 5–30% of parameters across seven algorithms and ten model families, in nearly full-rank, seed-stable subnetworks — not noise, but structured (2025-05, arXiv:2505.11711).
• On-policy RL amplifies a single dominant pretraining-distribution format within the first epoch while suppressing alternatives; capability-boundary collapse narrows problem-solving scope (2025-04, arXiv:2504.07912; 2025-07, arXiv:2508.00222).
• Networks learn dense activations for familiar data and sparse representations for OOD inputs; entropy collapse is the primary reasoning-scaling bottleneck (2025-05, arXiv:2505.22617; 2026-03, arXiv:2603.03415).
• SFT on diverse demos preserves exploration breadth; RL training for search squeezes diversity through entropy collapse (2026-05, arXiv:2605.22817).

Anchor papers (verify; mind their dates):
• arXiv:2505.11711 (2025-05): Sparse subnetwork updates, full-rank, seed-stable.
• arXiv:2504.07912 (2025-04): Echo-chamber amplification of pretraining format.
• arXiv:2505.22617 (2025-05): Entropy collapse as reasoning bottleneck.
• arXiv:2605.22817 (2026-05): Diversity-trained policies improve test-time search.

Your task:
(1) RE-TEST THE CAUSAL CHAIN: Does on-policy *data distribution* directly cause parameter sparsity, or do entropy collapse and format-amplification arise from the RL *objective* (reward-driven convergence) independently? Check whether newer work (last 6 mo.) decouples these — e.g., off-policy replay, diversity-weighted losses, or multi-task RL relaxing sparsity without changing data source. Distinguish the durable claim ('RL narrows scope') from the perishable one ('sparsity is inevitable from on-policy data').
(2) Surface the strongest work contradicting the library's implicit hypothesis that on-policy RL is the root cause. Look for papers showing sparse, stable updates arising *without* on-policy data, or dense updates *with* on-policy data under novel training regimes (e.g., process reward, best-of-N).
(3) Propose two research questions that assume the regime may have shifted: (a) Can you decouple the 'good' face of sparsity (efficient, stable consolidation) from the 'bad' face (entropy/exploration loss) via orthogonal interventions? (b) Do newer model scales, instruction-tuned priors, or multi-agent orchestration change whether sparse subnetworks are sufficient for RL gains?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does sparsity in RL arise from training on policy-distribution data?

Sources 6 notes

Next inquiring lines