Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR shows both real behavioral changes and inflated metrics. Can these contradictory findings actually describe the same phenomenon from different angles, and what does that mean for evaluating reasoning improvements?

Note · 2026-02-23 · sourced from Flaws

Two RLVR findings appeared contradictory:

Spurious rewards work: Why do random rewards improve reasoning for some models but not others?, suggesting the reward signal itself matters less than the RL training process, which activates latent pretraining capabilities. This was treated as evidence that RLVR functions as a pretraining catalyst rather than a reasoning teacher.

Benchmark contamination: Since Does RLVR success on math benchmarks reflect genuine reasoning improvement?, the metric improvement may be data memorization rather than genuine reasoning activation.

The resolution: These findings operate at different measurement levels and can coexist:

Behavioral activation (genuine): RL training with any reward signal activates code reasoning formats and structured thinking patterns that exist in pretraining data but are dormant. This is visible in output format changes, thinking token usage, and exploration behavior changes — measurements not contaminated by benchmark overlap.
Benchmark improvement (inflated): The metric improvement on contaminated benchmarks is partially or fully attributable to memorization. Clean benchmarks show reduced or eliminated gains for spurious rewards, while correct rewards still improve.

The practical implication: RLVR research must separate behavioral measurements (how the model's reasoning process changes) from performance measurements (how benchmark scores change). Both are informative; conflating them produces confusion about what RLVR actually does. The one-shot activation finding (single example triggers 36%→73.6% improvement) may itself need re-evaluation on clean benchmarks.

Original note title

RLVR behavioral activation and benchmark improvement are separable — genuine pretraining activation can coexist with contamination-inflated metrics