Why do different models respond differently to spurious rewards?

This explores why a training trick that seems like it shouldn't work — rewarding a model with signals that have nothing to do with correct answers — boosts reasoning in some models but does nothing for others.

This explores why a training trick that seems like it shouldn't work — rewarding a model with signals unrelated to correctness — boosts reasoning in some models but does nothing for others. The corpus's sharpest answer is that the reward isn't teaching the model anything new; it's switching on behavior the model already learned during pretraining. The clearest demonstration is that Qwen2.5-Math jumps 16–25% on MATH-500 from random or even *incorrect* rewards, while Llama and OLMo show no gains at all Why do random rewards improve reasoning for some models but not others?. The difference isn't in the reward — it's identical noise for all of them — it's in what each model was pretrained on. Qwen had latent code-reasoning patterns waiting to be surfaced; the others didn't have the same thing to surface.

This reframes what reinforcement learning is doing in the first place. Rather than expanding a model's reasoning ability, reward learning mostly *activates* strategies already present and improves how efficiently the model samples them — staying inside the capability boundary set by pretraining, not pushing past it What does reward learning actually do to model reasoning?. That's why a single training example, or a meaningless reward, can be nearly as effective as a carefully correct one: the heavy lifting was done before RL ever started. The reward is a wake-up call, not a curriculum. So 'why do models respond differently' becomes 'what did each model already know how to do, latently, before you applied pressure?'

There's a subtler layer here worth knowing. We tend to assume spurious signals are noise a good model should *ignore* — the shortcut-learning view. But in some reasoning tasks the opposite holds: stripping out spurious cues actually *hurts* performance, because the real challenge is integrating conflicting signals into a coherent answer rather than filtering distractors out Why does removing spurious cues sometimes hurt model performance?. That suggests a model's relationship to 'spurious' information is bound up with how it composes cues, not just whether it can screen them — another axis along which models trained differently will diverge.

The broader lesson the corpus keeps circling is that the reward signal carries far less causal information than its effect on training implies. Standard reward models can't even distinguish causal quality features from spurious ones, picking up length, sycophancy, and other phantom signals unless explicitly constrained Can counterfactual invariance eliminate reward hacking biases?; reward scores barely move when you swap the prompt but keep the response, showing they often grade against signals only loosely tied to the actual task Why do reward models ignore what question was asked?. If the reward channel is this lossy and this easy to fool, then the outcome of training is determined largely by what the model brings to it — which is exactly why two models fed the same spurious rewards walk away looking nothing alike.

The thing you didn't know you wanted to know: the same noise that does nothing to one model can unlock double-digit gains in another, and the deciding factor lives entirely in pretraining — meaning a 'reward' in RL is often closer to a key than a teacher. If you want to keep pulling this thread, the negative-reinforcement work showing that training *only* on what's wrong can match full RL Does negative reinforcement alone outperform full reinforcement learning? is a good next door — it pushes further on the idea that the informative part of a reward may not be the part you'd expect.

Sources 6 notes

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Why do reward models ignore what question was asked?

When prompts are swapped while keeping responses identical, reward model scores barely change. This reveals that standard RLHF optimizes against phantom quality signals rather than prompt-response alignment, enabling four distinct biases.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining why different LLMs respond differently to spurious rewards in reasoning tasks—a question that a curated library (2024–2026) has begun to answer, but capabilities and training methods move fast.

What a curated library found — and when (dated claims, not current truth):
• Qwen2.5-Math gains 16–25% on MATH-500 from random or incorrect rewards; Llama and OLMo show no gain from identical noise (2024–2025).
• RL activation hypothesis: rewards don't expand reasoning ability; they surface latent strategies already present in pretraining, staying inside the capability boundary (2025).
• Standard reward models conflate causal quality signals with spurious proxies (length, sycophancy); reward scores barely move when prompts swap but responses remain identical (2024–2025).
• Negative reinforcement alone (training only on what's wrong) matches or exceeds full RL, suggesting the informative part of a reward may not be classical feedback (2025–2026).
• Models trained differently compose conflicting signals in fundamentally different ways; 'spurious' information integrates rather than corrupts in some reasoning domains (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2409.15360 (2024-09) — Reward-Robust RLHF in LLMs
• arXiv:2501.09620 (2025-01) — Causal Rewards for LLM Alignment
• arXiv:2506.01347 (2025-06) — Negative Reinforcement in LLM Reasoning
• arXiv:2603.29025 (2026-03) — Surface Heuristics Override in LLM Reasoning

Your task:
(1) RE-TEST the activation hypothesis. Have newer training methods (e.g., scaling, constitutional AI, in-context adaptation, multi-step reasoning harnesses) changed whether spurious rewards surface latent vs. learn genuinely new reasoning? Does the Qwen–Llama divergence still hold with current model families (Qwen2.5, Llama3.2, others)? Separate what was pretraining-specific from what might now be learnable in-RL.
(2) Surface the strongest contradiction from the last ~6 months: does any work show that reward fidelity *does* matter now, or that spurious signals can be learned rather than merely activated? Identify where the synthesis may have broken down.
(3) Propose two questions that assume the regime has shifted: (a) If models now learn rather than activate under RL, how do reward-model quality and pretraining interact differently? (b) Can we design spurious rewards that are *orthogonal* to pretraining to test whether activation dominates learning empirically?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do different models respond differently to spurious rewards?

Sources 6 notes

Next inquiring lines