Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Note · 2026-02-22 · sourced from Reinforcement Learning

The surprising finding is not that RL changes models — it's how little it changes them. Across PPO, GRPO, DPO, and four other algorithms applied to ten different LLM families, RL consistently updates only 5-30% of parameters. The rest remain effectively unchanged. This sparsity is intrinsic — no explicit sparsity-promoting regularizations or architectural constraints are applied.

The critical nuance is that these sparse updates are nearly full-rank. This is not low-rank adaptation (as in LoRA). The updated parameters span almost the full subspace that the parameter matrices can represent. So RL selects a small subset of parameters, but that subset is geometrically rich enough to represent complex transformations. The distinction matters: low-rank would mean RL operates in a constrained subspace; sparse-but-full-rank means RL identifies which parameters matter while preserving full expressivity.

Three additional properties make this pattern robust. First, subnetworks identified from different random seeds show substantially greater overlap than chance, suggesting the subnetwork is a structural property of the pretrained model, not an artifact of training. Second, finetuning the subnetwork alone recovers both the test accuracy and the actual parameter values of full finetuning. Third, the sparsity is distributed — nearly all parameter matrices receive similarly sparse updates rather than concentrating in a subset of layers.

The authors conjecture this sparsity arises primarily from training on data near the policy distribution. Since Does RL improve domain reasoning by adding knowledge or removing it?, the sparse-but-full-rank pattern provides a mechanistic explanation: RL doesn't need to transform the entire model because most of the model is already adequate. It just needs to adjust a targeted subset — the parameters that control which reasoning paths are taken.

This has implications for efficient RL training. If the effective parameter footprint is 5-30%, techniques that exploit this sparsity (targeted updates, efficient memory use) could dramatically reduce RL training cost without sacrificing quality.

Token-level 80/20 parallel: The parameter-level sparsity has a striking token-level analog. The "Beyond 80/20" analysis of RLVR shows that high-entropy minority tokens — the ~20% of tokens where the model is most uncertain — are the critical forking points that carry most of the learning signal. Restricting gradient updates to only these 20% of tokens matches or exceeds full-token updates (+11.04 on AIME'25 for Qwen3-32B). The remaining 80% of tokens are low-entropy, already-decided outputs where gradient updates add noise rather than signal. This creates a dual sparsity picture: RL updates 5-30% of parameters, and the effective signal comes from ~20% of tokens. Both forms of sparsity are intrinsic — not imposed by regularization — and both suggest RL is fundamentally a targeted refinement process rather than a wholesale model transformation. See Do only 20 percent of tokens actually matter for reasoning?.

The same sparse-update structure appears in SFT. Core Parameter Isolation Fine-Tuning (CPI-FT) identifies task-specific "core parameter regions" — the parameters with largest update magnitudes during individual task fine-tuning — and shows that these regions are concentrated and task-specific. CPI-FT exploits this by transplanting core parameters from individually fine-tuned models and SLERP-merging non-core parameters, consistently outperforming full multi-task SFT. The key finding: full multi-task SFT (uniform parameter updates across all tasks) is consistently the worst performer — temporal task scheduling alone is insufficient without explicit structural parameter isolation. This extends the RL sparsity finding to supervised fine-tuning: task-relevant changes naturally concentrate in specific parameter regions regardless of whether the training signal is reward-based or loss-based. See Can isolating task-specific parameters prevent multi-task fine-tuning interference?.

Source: Reinforcement Learning; enriched from Training Fine Tuning, RLVR

Related concepts in this collection

Does RL improve domain reasoning by adding knowledge or removing it? When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
complementary: pruning describes the functional effect, sparse subnetworks describe the parametric mechanism
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends: if RL updates are inherently sparse, entropy collapse may reflect exhaustion of the relevant subnetwork's capacity
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
supports: sparse updates are consistent with RL adjusting activation patterns rather than building new capabilities
Does RL teach reasoning or teach when to use it? Post-training RL gets credit for building reasoning into language models, but emerging evidence suggests base models already possess this capability. The question is whether RL creates new reasoning skills or simply teaches deployment timing.
parametric evidence for the post angle: if RL only updates 5-30% of parameters in full-rank subnetworks, the model already has the capability; RL is selecting which parameters to activate, not building new ones
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
mechanistic complement: the latent-capability thesis predicts sparse updates because most of the model is already adequate; sparse subnetworks are the parametric signature of capability elicitation rather than creation
Do only 20 percent of tokens actually matter for reasoning? Chain-of-thought reasoning might depend on a small minority of high-entropy tokens that act as decision points. If true, could training focus only on these critical tokens match or exceed full-gradient updates?
token-level parallel: parameter sparsity (5-30%) + token sparsity (~20%) form dual intrinsic sparsity
Can sparse weight training make neural networks interpretable by design? Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.
RL's natural sparsity and weight-sparse training represent complementary paths to the same structural principle: RL discovers sparse subnetworks post-hoc through optimization pressure, while weight-sparse training enforces sparsity by construction; together they suggest that sparse parameter utilization is fundamental to how neural networks organize task-relevant computation

Concept map

15 direct connections · 147 in 2-hop network ·dense cluster

Does reinforcement learning update only a small … Does RL improve domain reasoning by adding knowled… Does policy entropy collapse limit reasoning perfo… Does RL teach reasoning or just when to use it? Does RL teach reasoning or teach when to use it? Do base models already contain hidden reasoning ab… Do only 20 percent of tokens actually matter for r… Can sparse weight training make neural networks in…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

rl updates only 5-30 percent of parameters in sparse but full-rank subnetworks