What happens to model capability as weight sparsity increases during training?

This explores what happens to a model's capability when its weights are made sparse during training — and the corpus splits this into two distinct phenomena: sparsity you impose deliberately, and sparsity that emerges on its own.

This explores what happens to capability as weight sparsity rises during training, and the corpus suggests the answer depends entirely on *which kind* of sparsity you mean — and the trade isn't a simple loss of capability. The most direct case is deliberately training with sparse weights: when you force most connections to zero, transformers reorganize into compact, human-readable circuits where individual neurons map to simple concepts, and ablation confirms those circuits are both necessary and sufficient for the task Can sparse weight training make neural networks interpretable by design?. So capability isn't destroyed — it gets *concentrated* and made legible. The catch is scale: this interpretability-by-construction has only been shown up to tens of millions of parameters, and holding it together at larger sizes is unsolved. The honest framing is a trade of raw capacity for transparency, not a free lunch.

The more surprising thread is that sparsity often shows up without anyone asking for it. Reinforcement learning, across seven algorithms and ten model families, spontaneously confines its updates to just 5–30% of parameters — no regularization required — and those sparse updates are nearly full-rank and nearly identical across random seeds, meaning the model is selecting a structured subnetwork rather than randomly pruning Does reinforcement learning update only a small fraction of parameters?. The companion work on what actually changes inside the model adds the mechanism: this sparse footprint comes mostly from *suppressing* wrong trajectories rather than amplifying right ones What actually changes inside a model during RL training?. So here, increasing weight sparsity coincides with capability *gains* — the sparsity is the signature of efficient, targeted learning, not damage.

There's a third register the corpus brings in laterally: sparsity in activations and representations, which behaves differently again. Models sparsify their hidden states when they hit unfamiliar, out-of-distribution tasks — and rather than a failure mode, this acts as a selective filter that *stabilizes* performance under shift Do language models sparsify their activations under difficult tasks?. The reason is that density itself is learned: networks build dense representations for data they've seen a lot and fall back to sparse ones for the unfamiliar Is representational sparsity learned or intrinsic to neural networks?. Capability and sparsity here are inversely linked to *familiarity*, not to skill.

Put together, the corpus reframes the question. Sparsity isn't a dial that trades away capability — it's a fingerprint of *how* a model is learning. Imposed weight sparsity buys interpretability at a capacity cost; emergent weight sparsity in RL is the trace of learning that works; activation sparsity is an adaptive response to the unknown. The thing worth knowing you didn't know to ask: low *drift* from the base model — staying close to the original weight distribution — is what preserves the model's ability to keep learning new tasks, while parameter-heavy approaches stall when the domain shifts Does staying close to the base model preserve learning ability?. So the real risk to capability isn't sparse updating — it's losing plasticity by drifting too far, too densely.

Sources 6 notes

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

What happens to model capability as weight sparsity increases during training?

Sources 6 notes

Next inquiring lines