Does careful reward engineering matter if pretraining determines RLVR effectiveness?
This explores whether tuning the reward signal in RLVR (reinforcement learning with verifiable rewards) actually changes outcomes, given the corpus finding that a model's pretraining largely sets the ceiling on what RLVR can do.
This explores whether reward engineering still earns its keep when pretraining seems to decide how well RLVR works in the first place. The corpus has a strikingly consistent answer: pretraining sets the ceiling, but reward design still decides whether you hit it, fall short, or actively break things — so both matter, just at different layers.
Start with the deflationary view. Several notes argue RLVR doesn't teach new reasoning at all — it surfaces capabilities the base model already had. RLVR improves sampling efficiency rather than expanding what's solvable, with base models actually winning at high pass@k Does RLVR actually expand what models can reason about?. The most vivid evidence is that even random or incorrect rewards can lift performance — but only for models whose pretraining left the right latent behavior to activate. Qwen2.5-Math jumps 16–25% from spurious rewards by waking up latent code-reasoning, while Llama and OLMo gain nothing Why do random rewards improve reasoning for some models but not others?. If a wrong reward works as well as a right one, the reward signal looks almost irrelevant Why does RLVR work with completely random rewards?, What does reward learning actually do to model reasoning?.
But here's the turn that makes reward engineering matter again: those 'reward doesn't matter' results mostly hold on contaminated benchmarks, where gains are memorization rather than reasoning. On clean, post-release benchmarks, only genuinely correct rewards help — random and inverse rewards fail or degrade the model Does RLVR success on math benchmarks reflect genuine reasoning improvement?. Behavioral activation and benchmark improvement turn out to be separable phenomena, so 'spurious rewards work' and 'reward quality matters' can both be true at once depending on what you're measuring Can genuine reasoning activation coexist with contaminated benchmarks?.
Reward design also determines the damage. Overly hard training samples push models toward degenerate shortcuts — answer repetition, computation-skipping — that contaminate pre-existing skills, because group-relative normalization over-rewards rare lucky successes Do overly hard RLVR samples actually harm model capabilities?. Binary correctness rewards quietly wreck calibration by rewarding confident guessing, and adding a Brier-score term provably fixes it without trading off accuracy Does binary reward training hurt model calibration?. And the *shape* of the signal matters: negative reinforcement alone — just suppressing wrong trajectories — can match or beat full RL while preserving the answer diversity that positive-only training collapses Does negative reinforcement alone outperform full reinforcement learning?.
So the resolution is that pretraining and reward engineering operate on different jobs. Pretraining decides which formats and behaviors are even available to amplify — RL converges on a single dominant pretraining format within the first epoch and suppresses the rest Does RL training collapse format diversity in pretrained models?, touching only a sparse 5–30% of parameters in a strikingly consistent subnetwork Does reinforcement learning update only a small fraction of parameters?. Reward engineering decides whether that narrow, powerful intervention lands on real reasoning or on shortcuts, whether calibration survives, and whether diversity is preserved. The unexpected payoff for the curious reader: the interesting design frontier may not be the reward value at all but *which trajectories you process and how* — treating successes as demonstrations and failures as abstracted lessons outperforms uniform reward application Should successful and failed episodes be processed differently?. Pretraining loads the gun; reward engineering decides where it points.
Sources 12 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.