Can RLVR expand a model's reasoning capabilities beyond its training ceiling?

This explores whether RLVR (reinforcement learning from verifiable rewards) actually teaches a model to reason about things it couldn't before — versus just getting better at surfacing reasoning the base model already had.

This explores whether RLVR can push a model past its training ceiling, or whether it only sharpens what's already inside. The corpus leans hard toward the second answer — with one important crack in the consensus worth knowing about. The dominant finding is that RLVR doesn't expand the set of problems a model can solve; it narrows the model's sampling toward solutions already living in the base distribution. The cleanest evidence is the pass@k test: at high k (many sampling attempts), base models actually match or beat their RLVR-trained versions, which means RLVR isn't unlocking new solvable problems — it's making the model land on existing ones faster Does RLVR actually expand what models can reason about?. A striking corollary: a single training example can trigger the effect, and even spurious (wrong) rewards work nearly as well as correct ones, as long as the pretraining is there What does reward learning actually do to model reasoning?.

The deeper reframe across these notes is that the capability was never the bottleneck — elicitation was. Base models already carry reasoning in latent form, and at least five independent methods (RL steering, critique fine-tuning, decoding tweaks, SAE feature steering, and RLVR) all reach in and pull out the same pre-existing patterns Do base models already contain hidden reasoning ability?. One framing puts it sharply: RL teaches a model *when* to reason, not *how* — hybrid models recover 91% of the gains by just routing tokens, and the reasoning activation vectors exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?. You can even get big jumps with no training at all — modular 'cognitive tools' lifted GPT-4.1 on a hard math benchmark from 26.7% to 43.3% purely through structured prompting that isolates reasoning operations Can modular cognitive tools unlock reasoning without training?.

There's also a darker side to RLVR that cuts against the 'expansion' story entirely: it can actively shrink the model. Because RLVR stays on-policy and favors exploitation over exploration, it can trigger 'capability boundary collapse,' where the model's problem-solving scope narrows rather than grows Why does RLVR training narrow a model's problem solving ability?. Feed it problems that are too hard and it learns degenerate shortcuts — answer repetition, skipping computation — that then contaminate skills it already had Do overly hard RLVR samples actually harm model capabilities?. And the improvements that do show up may be shallower than they look: RLVR makes reasoning traces locally more coherent (fewer errors between adjacent steps) without guaranteeing the overall proof is valid Does RLVR actually improve mathematical reasoning or just coherence?.

Here's the crack in the consensus, and it's the thing you might not have known to look for. The narrowing result isn't a law of nature — it's a consequence of *how* RLVR is usually run. One note shows that with KL control, policy resetting, and crucially training on *diverse, non-mathematical* tasks, prolonged RL beats base models across all pass@k levels — genuinely expanding the boundary, not just optimizing sampling Can reinforcement learning discover reasoning strategies base models cannot?. The pattern: expansion seems possible exactly in domains where the base model has *no* established reasoning patterns to fall back on, while the 'just elicitation' finding holds in saturated domains like competition math. A related note adds that explicitly rewarding exploration of underused reasoning paths can reverse boundary collapse Why does RLVR training narrow a model's problem solving ability?.

So the honest synthesis: standard RLVR, as most papers measure it, is a deployment optimizer dressed up as a capability creator — it activates and routes latent skill, and can even erode capability if pushed badly. But that ceiling appears to be set by the training recipe and the domain, not by RLVR as a method. Change the exploration pressure and the task diversity, and the ceiling moves. If you want the cleanest version of the 'no expansion' case, start with the pass@k analysis Does RLVR actually expand what models can reason about?; if you want the counterexample that complicates it, go to the prolonged-RL result Can reinforcement learning discover reasoning strategies base models cannot?.

Sources 9 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Can RLVR expand a model's reasoning capabilities beyond its training ceiling?

Sources 9 notes

Next inquiring lines