Does RL amplify existing reasoning or create genuinely new computational strategies?
This explores the live debate over whether reinforcement learning actually teaches models new ways to think, or just gets better at surfacing reasoning the base model already had — and what conditions tip the answer one way or the other.
This explores whether RL invents new computational strategies or merely amplifies what's latent in the base model — and the corpus is genuinely split, which is the interesting part. The 'amplification' camp is large and specific. One line of work argues RL post-training teaches a model *when* to reason, not *how*: the reasoning strategies pre-exist as activation patterns before any training, and a hybrid model recovers 91% of the gains just by learning to route tokens Does RL post-training create reasoning or just deploy it?. The sampling-efficiency results push the same way — pass@k analysis shows base models actually *beat* RLVR-trained models at high k, meaning RL narrows the search toward answers already in the base distribution rather than expanding the set of solvable problems Does RLVR actually expand what models can reason about?. The mechanics back this up unnervingly: a single training example can be enough to trigger the gains, and spurious rewards work nearly as well as correct ones — which only makes sense if the reward is *activating* a latent strategy, not teaching one What does reward learning actually do to model reasoning?. And when you stress-test the supposedly-learned procedures on out-of-distribution variants, they collapse, suggesting RL sharpened template-matching rather than installing a real algorithm Do fine-tuned language models actually learn optimization procedures?.
But the corpus doesn't let amplification win cleanly. The strongest counterpoint shows that *prolonged* RL — run long, with KL control, policy resetting, and crucially on non-mathematical tasks where the base model has no established patterns to fall back on — produces models that outperform the base across *all* pass@k levels, not just at low k Can reinforcement learning discover reasoning strategies base models cannot?. That last detail is the key to reconciling the two camps: the 'RL only amplifies' results mostly come from math and code, domains the base model was already saturated with during pretraining. Push RL into territory the base model never mastered, and the boundary actually moves. So the answer may be less 'amplify vs. create' and more 'it depends on whether the base model already had the territory.'
What sharpens the whole debate is that you can elicit major reasoning gains with *no RL at all*. Wrapping four modular 'cognitive tools' around GPT-4.1 — sandboxed calls that isolate reasoning operations — jumped AIME performance from 26.7% to 43.3% with zero training Can modular cognitive tools unlock reasoning without training?. If structured prompting alone unlocks that much, a lot of what RL 'adds' really was sitting latent, waiting for the right scaffolding. The reward signal itself turns out to be more flexible than the verification-heavy framing suggests — you can drive general-domain reasoning gains using just the likelihood of a reference answer, no verifier required Can reasoning improvement work without answer verification?, which again points at RL as a *director* of existing capability rather than a teacher of new capability.
The thing you might not have known you wanted: the better question may not be 'amplify or create' but 'what *kind* of strategy.' Other notes show reasoning isn't one monolithic skill RL could simply expand — LLMs already run distinct strategic styles (minimax, trust-based, belief-anticipation) tied to task type Do large language models use one reasoning style or many?, creative reasoning splits into three paradigms current methods ignore entirely Can LLMs reason creatively beyond conventional problem-solving?, and even where reasoning 'works' it tends to wander unsystematically rather than search Why do reasoning LLMs fail at deeper problem solving?. Against that backdrop, the genuinely new strategies RL *might* create — like the structured breadth-first exploration that abstraction-guided training enforces Can abstractions guide exploration better than depth alone? — look less like raw capability gains and more like RL imposing *organization* on capabilities that were already there but firing chaotically.
Sources 11 notes
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.
Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.
Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.