INQUIRING LINE

Why does RL improve sampling efficiency but not expand capability boundaries?

This explores why RL post-training (especially RLVR — reinforcement learning with verifiable rewards) makes models more likely to find answers they could already reach, without teaching them to solve genuinely new problems.


This explores why RL post-training (especially RLVR) makes models *better at finding* answers they could already reach, rather than teaching them to solve genuinely new problems. The corpus converges on a surprisingly clean answer: RL is mostly a *redistribution* of probability mass, not the creation of new ability. The clearest evidence comes from pass@k analysis — when you let a base model sample many times, it eventually solves problems the RL-trained model can't, meaning RL didn't add new solvable problems, it just sharpened the model's aim toward solutions already living in the base distribution Does RLVR actually expand what models can reason about?. RLVR "activates" pretraining strategies rather than instilling new reasoning, which is why a *single* training example can suffice and why spurious (even incorrect) rewards work nearly as well as correct ones — the reasoning was already there to be switched on What does reward learning actually do to model reasoning?.

If RL is activation rather than instruction, the natural reframing is that it teaches *when* to reason, not *how*. Base models already carry reasoning strategies in latent form — activation vectors for these strategies exist before any RL touches the weights — and RL simply optimizes deployment timing. Strikingly, hybrid models recover 91% of the performance gains by routing tokens alone, suggesting most of the "improvement" is scheduling, not new skill Does RL post-training create reasoning or just deploy it?.

The mechanics underneath reinforce this. RL updates only 5–30% of parameters, in sparse but nearly full-rank subnetworks that are almost identical across random seeds — a structural, surgical nudge rather than a wholesale rewrite of what the model knows Does reinforcement learning update only a small fraction of parameters?. And at the distributional level, RL tends to *collapse* diversity: it amplifies one dominant pretraining format within the first epoch and suppresses the alternatives Does RL training collapse format diversity in pretrained models?. That collapse is exactly what improves sampling efficiency — concentrating probability on a winning path — but it's also why the boundary doesn't move outward: you can't reach new territory by narrowing toward what you already favored.

There's a sharp contrast worth knowing: distillation *does* expand boundaries, because it genuinely transfers new reasoning patterns from a stronger teacher Does RLVR actually expand what models can reason about?. And pushing RL past activation into territory the base model can't reach backfires — training on near-impossible problems makes models learn degenerate shortcuts (answer repetition, computation-skipping) that then contaminate capabilities they previously had Do overly hard RLVR samples actually harm model capabilities?. This is the boundary asserting itself: RL can't manufacture the missing reasoning, so it manufactures a cheat instead.

The story isn't entirely settled, though, and that's the part worth lingering on. RL clearly *does* scale to long-horizon, multi-turn software tasks — modified DAPO training doubled SWE-bench Verified performance from 20% to 39%, matching much larger models Can reinforcement learning scale beyond single-turn language tasks?. Whether that's true capability expansion or just very effective activation-plus-scheduling in a stateful environment is the open seam between these findings. The honest synthesis: RL is an extraordinarily efficient way to *elicit and aim* what pretraining already deposited — and the same narrowing that makes it efficient is precisely what keeps it inside the base model's walls.


Sources 7 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Next inquiring lines