Why does prolonged RL discover strategies absent from any base model sample?

This explores a genuine fight in the corpus: whether reinforcement learning can invent reasoning the base model never had, or whether it only sharpens sampling of strategies already latent inside — and what conditions tip it one way or the other.

This explores a genuine fight in the corpus, not a settled finding. One camp says RL invents nothing: it just gets better at fishing solutions out of a distribution the base model already contains. The sharpest version of that claim comes from pass@k analysis — at high k, base models actually *beat* their RL-trained versions, which means RL narrowed the search toward known answers rather than widening what's solvable at all Does RLVR actually expand what models can reason about?. The same picture appears in the finding that a single training example, or even a *spurious* reward, can trigger most of the gain — that's a signature of activation, not teaching What does reward learning actually do to model reasoning?. And a cleaner framing still: RL teaches a model *when* to reason, not *how*. Hybrid models recover 91% of the gains just by routing tokens, and the activation vectors for reasoning strategies exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?.

So why does the opposite result — strategies absent from *any* base sample — keep showing up? The reconciling note is that capability creation is **domain-conditional** Does reinforcement learning create new reasoning abilities or activate existing ones?. On standard reasoning, where the base model has seen the patterns, RL only activates what's latent. But on complex multi-step planning — where no established pattern exists to sample — RL generates genuinely novel strategies the base model can't reach even with extensive sampling. The 'prolonged RL' result lands here: trained long enough, on *diverse and non-mathematical* tasks, with KL control and policy resetting, RL-trained models win across *all* pass@k levels, which is the signature of an expanded boundary rather than a narrowed one Can reinforcement learning discover reasoning strategies base models cannot?. The disagreement between the two camps is largely a disagreement about which domains they tested.

What's quietly fascinating is *why prolonged-ness* matters, and a two-phase dynamic explains it. Early in training, RL is busy consolidating procedural execution — getting steps correct. Only in a second phase does strategic planning become the bottleneck, with planning-token entropy *rising* while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. Novel strategy, in other words, is a late-training phenomenon: you can't reach the exploration phase without paying for the consolidation phase first. Short runs never get there, which is part of why the 'RL discovers nothing' studies and the 'RL discovers new strategies' studies disagree — they may be sampling different points on the same curve.

There's a structural reason this is even possible without scrambling the model. RL updates only 5–30% of parameters, and those sparse updates are nearly full-rank and nearly identical across random seeds — meaning the model is making a *structured*, targeted edit, not a diffuse one Does reinforcement learning update only a small fraction of parameters?. Staying close to the base distribution turns out to be load-bearing: low KL drift preserves the plasticity needed to keep learning, while parameter-only methods that drift hard simply stall when the domain shifts Does staying close to the base model preserve learning ability?. So 'prolonged' discovery isn't brute-force divergence from the base model — it's a long, narrow, stable walk that keeps the base intact while carving new planning behavior on top.

The catch worth knowing: this discovery is fragile and cuts against diversity. The same RL that finds new planning strategies also *compresses* behavioral diversity through entropy collapse — policies converge on narrow reward-maximizing paths, the same way they do in search agents, where SFT on diverse demonstrations is what preserves exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. Push it with problems that are too hard and it doesn't discover at all — it learns degenerate shortcuts that then contaminate abilities the model already had Do overly hard RLVR samples actually harm model capabilities?. So the honest answer to the question is: prolonged RL discovers absent strategies only in a narrow regime — hard-but-tractable planning domains, long enough to reach the exploration phase, with the base model held close enough to stay plastic. Outside that regime, what looks like discovery is either activation of the latent, or active damage.

Sources 10 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why does prolonged RL discover strategies absent from any base model sample?

Sources 10 notes

Next inquiring lines