INQUIRING LINE

How do RL subnetworks identified from different random seeds compare?

This explores what happens when you train the same model with RL under different random seeds — do you get the same subset of updated parameters, or different ones each time?


This explores whether RL's parameter updates are arbitrary or structural — if you reran training with a different random seed, would a different slice of the network change? The corpus has a direct and striking answer: the subnetworks are *nearly identical* across seeds. Across seven RL algorithms and ten model families, RL turns out to update only 5–30% of parameters, and it picks almost the same ones every time, regardless of the seed Does reinforcement learning update only a small fraction of parameters?. That consistency is the whole point — it means RL isn't randomly nudging whatever weights happen to be in its path; it's repeatedly finding the same functional circuit. (Notably, these updates are also nearly full-rank, so this isn't the low-rank story you might expect from LoRA-style adaptation — it's sparse but rich.)

What makes this interesting is that the same 'convergence' signature shows up everywhere else RL has been studied in this collection. When you train with RL, the model doesn't just settle on the same parameters — it settles on the same *behavior*. RL post-training consistently amplifies a single dominant text format inherited from pretraining and suppresses the alternatives, usually within the first epoch Does RL training collapse format diversity in pretrained models?. So the determinism isn't only in which weights move; it's in where the whole policy lands.

That convergence has a cost, and the collection is candid about it. Outcome-based RL sharpens the policy globally, concentrating probability on correct trajectories while bleeding diversity even on problems it hasn't solved yet Does outcome-based RL diversity loss spread across unsolved problems?. The same entropy-collapse mechanism that narrows reasoning also narrows search agents, which converge on a few reward-maximizing strategies and lose exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. Put together, a picture emerges: RL is a funnel. Different seeds don't lead to different destinations — they lead to the same sparse subnetwork, the same dominant format, the same narrowed behavior.

The thing you might not have known you wanted to know: this seed-invariance is arguably good news for interpretability and editing. If RL reliably localizes its changes to the same small, structured region of the network, that region becomes a target — something you could inspect, prune, or transplant — rather than a diffuse fog spread unpredictably across billions of weights. The flip side is that the very reliability that makes RL editable is the same force that collapses diversity, which is why the corpus keeps pointing to SFT on varied demonstrations as the counterweight when breadth matters Does reinforcement learning squeeze exploration diversity in search agents?.


Sources 4 notes

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Next inquiring lines