INQUIRING LINE

Why do spurious reward signals improve reasoning for some pretrained models?

This explores why random or incorrect reward signals can still boost reasoning in certain pretrained models — and why the same trick does nothing for others.


This explores why random or incorrect reward signals can still boost reasoning in certain pretrained models — and why the same trick does nothing for others. The short answer the corpus keeps circling back to: the reward isn't teaching anything new. It's flipping a switch on behavior the model already learned during pretraining. When researchers gave Qwen2.5-Math rewards with zero correlation to correct answers — even random ones — it gained 16-25% on MATH-500, while Llama and OLMo gained nothing Why do random rewards improve reasoning for some models but not others?. The difference wasn't the reward; it was that Qwen had absorbed latent code-reasoning patterns during pretraining that the optimization pressure could surface. The pretraining format determines what's there to be activated.

This reframes what reinforcement learning is actually doing. The dominant story is that RLVR (reinforcement learning from verifiable rewards) elicits rather than creates — it improves how efficiently a model samples from strategies it already has, without pushing past its capability boundary, and a single training example can be enough to trigger the activation What does reward learning actually do to model reasoning?. That fits a broader finding from five independent lines of work — RL steering, critique fine-tuning, decoding tweaks, feature steering, and RLVR all unlock reasoning already sitting in base-model activations Do base models already contain hidden reasoning ability?. Post-training selects; it doesn't build. So a spurious reward works for the same reason a correct one does: both are just nudges that bias sampling toward latent good behavior, and if that behavior exists, even a noisy nudge finds it.

That also explains the asymmetry you'd otherwise find baffling. If the model has no latent reasoning strategy to surface, there's nothing for the reward — correct or spurious — to amplify, which is why Llama and OLMo flatline. The capability ceiling is set in pretraining; reward signal quality mostly governs whether you reach it, not how high it is.

There's a subtler mechanism worth knowing here. Part of why even uninformative rewards help may be that what's doing the work isn't the positive signal at all. Training on only negative samples — suppressing wrong trajectories — matches or beats full RL, because it preserves answer diversity while positive-only reinforcement collapses probability mass onto a few paths Does negative reinforcement alone outperform full reinforcement learning?. If a chunk of the benefit comes from pruning bad paths rather than rewarding good ones, you'd expect rewards loosely tied (or untied) to correctness to still do something useful.

The honest caveat: this is a property of the model, not a free lunch. The same literature warns that reward quality matters enormously once you care about more than benchmark accuracy — binary correctness rewards quietly wreck calibration by encouraging confident guessing Does binary reward training hurt model calibration?, and standard training can't tell causal quality signals from spurious correlated ones unless you force the distinction Can counterfactual invariance eliminate reward hacking biases?. So spurious rewards 'working' is really a diagnostic: it tells you the reasoning was pretrained in and the reward is just an activation key — which is a very different thing from the reward being good.


Sources 6 notes

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Next inquiring lines