Can one training example activate mathematical reasoning in RL-trained models?
This explores the surprising finding that a single training example can unlock math reasoning in models trained with reinforcement learning from verifiable rewards (RLVR) — and what that says about whether RL teaches reasoning or merely switches on ability the model already had.
This explores whether one training example can activate mathematical reasoning in RL-trained models, and the corpus has a striking direct answer: yes. A single example in RLVR can lift math performance from 36% to 73.6%, and — even stranger — test accuracy keeps climbing for 1,400 steps after training accuracy has already hit 100% (Can a single training example unlock mathematical reasoning?). That a lone example does so much is the tell: the model isn't learning math from that example. It's being switched on.
That reframing is the through-line across the collection. Several independent lines of work converge on the idea that base models already carry reasoning ability in latent form, and training merely elicits it. One synthesis catalogs five separate mechanisms — RL steering, critique fine-tuning, decoding tweaks, sparse-autoencoder feature steering, and RLVR — that all surface reasoning already sitting in base-model activations, concluding the bottleneck is elicitation, not capability (Do base models already contain hidden reasoning ability?). A related strand argues RL teaches *when* to reason rather than *how*: a hybrid model recovered 91% of the performance gains using only 12% of the tokens, suggesting RL acts as a deployment-timing optimizer (Does RL post-training create reasoning or just deploy it?, Does RL teach reasoning or just when to use it?). Pulled together, the 'one example' result stops looking like a fluke and starts looking like a prediction of the activation view (What does reward learning actually do to model reasoning?).
Here's the part you might not expect: if RL is mostly flipping a switch, then the *quality* of the reward signal should matter less than its existence — and that's exactly what shows up. Spurious rewards work nearly as well as correct ones for models with the right pretraining (What does reward learning actually do to model reasoning?). But that same finding carries a warning. On contaminated benchmarks, RLVR's apparent gains turn out to be memorization, not reasoning — one model reconstructed 54.6% of MATH-500 from partial prompts yet scored 0.0% on a post-release benchmark, and on clean data only genuinely correct rewards helped (Does RLVR success on math benchmarks reflect genuine reasoning improvement?). So 'one example activates reasoning' and 'rewards are just memorization' aren't contradictions; they describe what activation does and doesn't buy you.
The corpus also pushes back on the tidy 'RL only elicits, never expands' story. Prolonged RL with KL control, policy resetting, and non-mathematical tasks can discover genuinely novel strategies that base models can't reach at any sampling budget — outperforming them across all pass@k levels (Can reinforcement learning discover reasoning strategies base models cannot?). And activation alone doesn't guarantee correctness: RLVR measurably improves the coherence between adjacent reasoning steps without making the overall proof valid — locally smooth, globally wrong (Does RLVR actually improve mathematical reasoning or just coherence?). The single example wakes the reasoning up; it doesn't make the reasoning true.
If you want to follow the thread further, the corpus branches into how to make that activated reasoning useful: curriculum approaches that run imitation first to create rollouts worth sharpening (Does sequencing imitation then exploration training improve reasoning?), verifier-free reward signals for domains where answers can't be checked (Can reasoning improvement work without answer verification?), and the discovery that binary correctness rewards quietly wreck calibration by rewarding confident guessing (Does binary reward training hurt model calibration?). The one-example result is the door; behind it is a whole debate about whether we're growing reasoning or just learning to find the light switch.
Sources 11 notes
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.