Can a single correct example seed exponential improvement in mathematical reasoning?
This explores whether one correct example can trigger outsized gains in math reasoning — and the corpus suggests the gain is real but is better understood as *unlocking* latent ability than as compounding new skill from a single seed.
This explores whether one correct example can trigger outsized gains in math reasoning. The most direct evidence says yes: a single training example in RLVR lifted math accuracy from 36% to 73.6%, and — strikingly — test accuracy kept climbing for 1,400 steps after training accuracy already hit 100% (Can a single training example unlock mathematical reasoning?). That 'post-saturation generalization' is the closest thing here to your 'exponential' intuition: improvement that continues long after the model has nothing left to memorize from the example itself. A parallel result reaches the same place by a different road — fine-tuning on critiques of one problem's varied solutions unlocks comparable reasoning, no reinforcement learning needed (Can a single problem unlock reasoning through solution critique?).
But read the two together and the word that fits better than 'seed' is *activation*. Both papers frame the single example as a trigger for capability the base model already had, not as material the model learns from. The example isn't teaching multiplication — it's flipping a switch on a circuit that was already wired. This matters because it predicts where the trick stops working: you can only activate what's latent. A model that genuinely couldn't do the math wouldn't generalize for 1,400 steps from one problem.
The corpus then gets skeptical in a useful way. Several notes argue that big benchmark jumps can be mirages. RLVR's gains on contaminated benchmarks turn out to be largely memorization — one model reconstructed 54.6% of MATH-500 from partial prompts yet scored 0.0% on a clean post-release benchmark — and notably, *only correct rewards* improved clean performance, while random or inverted rewards did nothing or hurt (Does RLVR success on math benchmarks reflect genuine reasoning improvement?). So the 'single correct example' result survives this critique on one count (correctness of the signal does matter) but should make you ask whether the headline number reflects reasoning or leakage. Relatedly, RLVR can improve the *coherence* of reasoning traces — fewer logical jumps between adjacent steps — without making the overall proof valid (Does RLVR actually improve mathematical reasoning or just coherence?), and supervised fine-tuning can raise answer accuracy while *degrading* the quality of inferential steps by 38.9%, producing right answers through post-hoc rationalization (Does supervised fine-tuning improve reasoning or just answers?).
Here's the part you didn't know you wanted to know: a cluster of findings suggests reasoning training often teaches *form*, not inference. Logically invalid chain-of-thought exemplars perform nearly as well as valid ones (Does logical validity actually drive chain-of-thought gains?), and deliberately corrupted traces teach as well as correct ones — sometimes generalizing *better* out of distribution (Do reasoning traces need to be semantically correct?). Traces seem to act as computational scaffolding rather than meaningful steps. This sits in productive tension with the single-example result, which insists the example must be *correct*. The reconciliation: when you supply many traces, the model latches onto their shape; when you supply one activating signal under verifiable reward, correctness is what tells the model which latent behavior to switch on.
So the honest answer to 'exponential improvement from one example' is: real, large, and continuing past saturation — but it's the discharge of stored potential, not compounding growth from scratch, and it's bounded by what the base model already contains. If you want to go further on those limits, the corpus also notes that more reasoning isn't free — accuracy peaks then *declines* past a thinking-token threshold (Does more thinking time always improve reasoning accuracy?), and that knowledge and reasoning live in different network layers, which is why activating reasoning helps math but can quietly damage knowledge-heavy domains (Why does reasoning training help math but hurt medical tasks?).
Sources 9 notes
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.
Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.