Can extended RL training unlock genuinely new reasoning strategies models cannot discover otherwise?

This explores whether reinforcement learning actually *creates* reasoning a model couldn't otherwise reach, or just surfaces and sharpens reasoning the base model already had — and the corpus is split down the middle on exactly that question.

This explores whether reinforcement learning actually *creates* new reasoning ability or merely deploys what's already latent — and the interesting thing is that the corpus genuinely disagrees with itself, which tells you this is a live frontier, not a settled fact. The optimistic camp says yes: prolonged RL on *diverse* tasks — especially non-mathematical ones where base models have no established patterns to fall back on — discovers genuinely novel strategies that beat the base model at every sampling budget, not just at the easy end Can reinforcement learning discover reasoning strategies base models cannot?. The key word there is *prolonged* and the key conditions are KL control, policy resetting, and task diversity. Take those away and the story flips.

The skeptical camp runs a clever test: pass@k. If you let the base model sample k times, does it eventually solve everything the RL model solves? The answer, repeatedly, is that base models actually *win* at high k — meaning RL didn't expand the space of solvable problems, it just concentrated probability onto solutions the base model could already find, more reliably Does RLVR actually expand what models can reason about?. Reward learning under this view 'activates' pretraining strategies rather than teaching anything new — which is why a single training example, or even spurious/random rewards, can work nearly as well as correct ones for a model with the right pretraining What does reward learning actually do to model reasoning?. A whole cluster of evidence supports this 'elicitation, not acquisition' reading: five independent methods (RL steering, critique tuning, decoding tricks, SAE feature steering, RLVR) all unlock reasoning that was sitting in the base model's activations the whole time Do base models already contain hidden reasoning ability?. The sharpest framing of all: RL teaches a model *when* to reason, not *how* — hybrid models recover 91% of the gains just by routing tokens, and the activation vectors for reasoning strategies exist *before* any RL touches the weights Does RL post-training create reasoning or just deploy it?.

So how do these reconcile? The dividing line seems to be *novelty of domain* and *length of training*. RLVR studies that find no boundary expansion mostly test math, where base models are saturated with patterns — there's nothing new to discover, only sampling to sharpen. The study that does find genuine novelty deliberately moved off math into territory where the base model had no playbook. That suggests RL can't conjure reasoning from nothing, but in domains thin on pretraining coverage, 'sharpening what's latent' and 'discovering something new' may be the same act viewed from two angles.

If you want to follow the thread further, the corpus offers escape hatches around the whole RL framing. One line argues reasoning should be planted *earlier* — treat chain-of-thought as an exploratory action during pretraining itself, rewarded by information gain, rather than bolted on afterward Can chain-of-thought reasoning be learned during pretraining itself?. Another shows distillation genuinely transfers *new* reasoning patterns where RLVR doesn't — implying the way to get truly novel strategies into a model is to copy them from a stronger one, not reward-shape them in Does RLVR actually expand what models can reason about?. And a third bypasses the capability question entirely: modular 'cognitive tools' lifted GPT-4.1 on hard competition math from 27% to 43% with *zero* RL, just by structurally isolating reasoning operations Can modular cognitive tools unlock reasoning without training?. The thing you might not have known you wanted to know: the same mechanism RL is supposed to install — extended 'thinking' — actively *hurts* vanilla models by inducing self-doubt, and RL's real contribution may be less about new strategies and more about flipping that thinking from counterproductive to productive Does extended thinking help or hurt model reasoning?.

Sources 8 notes

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can extended RL training unlock genuinely new reasoning strategies models cannot discover otherwise?

Sources 8 notes

Next inquiring lines