Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

The surprising finding is not that RL changes models — it's how little it changes them. Across PPO, GRPO, DPO, and four other algorithms applied to ten different LLM families, RL consistently updates only 5-30% of parameters. The rest remain effectively unchanged. This sparsity is intrinsic — no explicit sparsity-promoting regularizations or architectural constraints are applied.

The critical nuance is that these sparse updates are nearly full-rank. This is not low-rank adaptation (as in LoRA). The updated parameters span almost the full subspace that the parameter matrices can represent. So RL selects a small subset of parameters, but that subset is geometrically rich enough to represent complex transformations. The distinction matters: low-rank would mean RL operates in a constrained subspace; sparse-but-full-rank means RL identifies which parameters matter while preserving full expressivity.

Three additional properties make this pattern robust. First, subnetworks identified from different random seeds show substantially greater overlap than chance, suggesting the subnetwork is a structural property of the pretrained model, not an artifact of training. Second, finetuning the subnetwork alone recovers both the test accuracy and the actual parameter values of full finetuning. Third, the sparsity is distributed — nearly all parameter matrices receive similarly sparse updates rather than concentrating in a subset of layers.

The authors conjecture this sparsity arises primarily from training on data near the policy distribution. Since Does RL improve domain reasoning by adding knowledge or removing it?, the sparse-but-full-rank pattern provides a mechanistic explanation: RL doesn't need to transform the entire model because most of the model is already adequate. It just needs to adjust a targeted subset — the parameters that control which reasoning paths are taken.

This has implications for efficient RL training. If the effective parameter footprint is 5-30%, techniques that exploit this sparsity (targeted updates, efficient memory use) could dramatically reduce RL training cost without sacrificing quality.

Token-level 80/20 parallel: The parameter-level sparsity has a striking token-level analog. The "Beyond 80/20" analysis of RLVR shows that high-entropy minority tokens — the ~20% of tokens where the model is most uncertain — are the critical forking points that carry most of the learning signal. Restricting gradient updates to only these 20% of tokens matches or exceeds full-token updates (+11.04 on AIME'25 for Qwen3-32B). The remaining 80% of tokens are low-entropy, already-decided outputs where gradient updates add noise rather than signal. This creates a dual sparsity picture: RL updates 5-30% of parameters, and the effective signal comes from ~20% of tokens. Both forms of sparsity are intrinsic — not imposed by regularization — and both suggest RL is fundamentally a targeted refinement process rather than a wholesale model transformation. See Do only 20 percent of tokens actually matter for reasoning?.

The same sparse-update structure appears in SFT. Core Parameter Isolation Fine-Tuning (CPI-FT) identifies task-specific "core parameter regions" — the parameters with largest update magnitudes during individual task fine-tuning — and shows that these regions are concentrated and task-specific. CPI-FT exploits this by transplanting core parameters from individually fine-tuned models and SLERP-merging non-core parameters, consistently outperforming full multi-task SFT. The key finding: full multi-task SFT (uniform parameter updates across all tasks) is consistently the worst performer — temporal task scheduling alone is insufficient without explicit structural parameter isolation. This extends the RL sparsity finding to supervised fine-tuning: task-relevant changes naturally concentrate in specific parameter regions regardless of whether the training signal is reward-based or loss-based. See Can isolating task-specific parameters prevent multi-task fine-tuning interference?.


Source: Reinforcement Learning; enriched from Training Fine Tuning, RLVR

Related concepts in this collection

Concept map
15 direct connections · 147 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl updates only 5-30 percent of parameters in sparse but full-rank subnetworks