Why do random rewards improve reasoning for some models but not others?
Spurious rewards boost Qwen's math reasoning by 16-25% but fail for Llama and OLMo. We explore whether reward quality matters, or if pretraining strategy determines what RLVR can unlock.
RLVR improves MATH-500 performance for Qwen2.5-Math-7B by 21.4% with random rewards, 16.4% with format-only rewards, 24.6% with incorrect labels, and 24.4% with 1-shot RL — nearly matching the 28.8% gained with ground truth rewards. The reward signal appears almost irrelevant to the outcome.
But these spurious rewards fail entirely for Llama3 and OLMo2 model families. The critical variable is not the reward but the pretraining strategy. Qwen2.5-Math develops a distinctive "code reasoning" behavior — thinking in code without execution — that rises from 66.7% to over 90% frequency after RLVR, even with spurious rewards. Other model families lack this particular latent strategy.
This is perhaps the strongest evidence for Does RL teach reasoning or just when to use it?. If random rewards work as well as correct rewards for specific models, then RLVR's function is not to provide direction but to provide pressure. The optimization signal — any optimization signal — activates preexisting reasoning strategies encoded during pretraining. The reward is a catalyst, not a teacher.
Since Does training data format shape reasoning strategy more than domain?, the Qwen code-reasoning strategy is a pretraining format artifact. RLVR surfaces it; the specific reward signal is incidental to the surfacing. Models without that pretraining format cannot benefit from the same activation pressure.
The practical implication is sobering: RLVR effectiveness may be almost entirely determined before RLVR training begins. The investment in careful reward engineering may be less important than the investment in pretraining data composition.
Critical challenge: data contamination. The RandomCalculation paper directly challenges the "any reward works" interpretation. Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 problems from partial prompts (first 60%); on post-release LiveMathBench this drops to 0.0%. On a fully clean benchmark of synthetic arithmetic (guaranteed to post-date model release), random rewards produce unstable training with no reliable improvement, while correct rewards deliver consistent gains surpassing the model's ceiling. This means the benchmark gains that motivated the "reward doesn't matter" narrative may be substantially inflated by memorization. The code-reasoning behavior change (66.7% → 90%+) is real and not explained by contamination alone — but the headline finding requires significant qualification. See Does RLVR success on math benchmarks reflect genuine reasoning improvement? for the full contamination argument and ops/tensions/rlvr-spurious-rewards-work-vs-rlvr-gains-are-data-contamination-artifacts.md for the tension analysis.
Related concepts in this collection
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
spurious rewards are the strongest confirmation that RL teaches timing not capability
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
code reasoning as pretraining format artifact explains model-specificity
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
any reward pressure unlocks latent strategies
-
Do reasoning traces need to be semantically correct?
Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
parallel: corrupted inputs can still yield gains
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
spurious rewards with no correlation to correct answers still improve rlvr reasoning — but only for models with specific pretraining strategies