INQUIRING LINE

Can one training example activate mathematical reasoning without reinforcement learning?

This explores a real tension in the question: the famous 'one training example' result actually comes from RLVR (a form of reinforcement learning), so the honest answer is that the corpus shows minimal *signals* activate latent math reasoning — and that several non-RL methods can do the same eliciting job.


This explores whether a single example can switch on mathematical reasoning, and whether reinforcement learning is the necessary ingredient. The cleanest place to start is also where the question gets complicated: the headline 'one example' finding Can a single training example unlock mathematical reasoning? happens *inside* RLVR — one example lifts math accuracy from 36% to 73.6%, with test accuracy still climbing 1,400 steps after training accuracy hits 100%. So strictly, that specific result uses RL. But the reason it works tells you RL isn't doing what people assume.

The corpus reframes RL as an *activator*, not a teacher. Reward learning improves how efficiently a model samples from strategies it already has, rather than installing new ones What does reward learning actually do to model reasoning? — which is exactly why a single example, or even a spurious reward, is enough: there's little to teach, only something to switch on. Pushed further, multiple independent mechanisms — RL steering, critique fine-tuning, decoding changes, SAE feature steering — all elicit reasoning that was already latent in base-model activations, suggesting post-training *selects* reasoning rather than creating it Do base models already contain hidden reasoning ability?. A complementary framing: RL teaches *when* to deploy reasoning, not *how* to reason, with one hybrid setup recovering 91% of the gains using 12% of the tokens Does RL teach reasoning or just when to use it?.

If the capability is latent, then the 'without RL' half of your question has a satisfying answer: yes. Four modular 'cognitive tools' implemented as sandboxed LLM calls lifted GPT-4.1 on a hard math benchmark from 26.7% to 43.3% with *no RL training at all* — structured isolation elicited reasoning that pure prompting couldn't reliably reach Can modular cognitive tools unlock reasoning without training?. Activation steering goes even lower-cost: a single direction extracted from 50 paired examples reshapes reasoning behavior training-free Can we steer reasoning toward brevity without retraining?. And energy-based transformers reach 'System 2' deliberation from unsupervised learning alone, no domain-specific reward scaffolding Can energy minimization unlock reasoning without domain-specific training?. These are different doorways into the same room: the bottleneck is elicitation, not acquisition.

The interesting dissent is worth knowing about. One line of work argues prolonged RL on diverse tasks — with KL control and policy resetting — discovers genuinely *novel* strategies that base models can't reach, beating them at every pass@k level Can reinforcement learning discover reasoning strategies base models cannot?. So the field hasn't fully settled whether RL only unlocks or can sometimes create. The takeaway you didn't know you wanted: the surprise of 'one example is enough' isn't really about the example — it's evidence that the math reasoning was sitting in the model the whole time, and RL is just one of several keys that happen to fit the lock.


Sources 8 notes

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Next inquiring lines