Can clean benchmarks reveal true RLVR reasoning gains?
This explores whether stripping out benchmark contamination (clean, post-release test sets) lets us see RLVR's real reasoning gains — and what the corpus says is actually moving when it does.
This explores whether clean benchmarks can reveal RLVR's true reasoning gains — the answer the corpus gives is more interesting than yes or no: clean benchmarks expose what's *not* reasoning, but they don't show RLVR adding new reasoning either. The clearest demonstration is the contamination test itself. A model like Qwen2.5-Math-7B can reconstruct over half of MATH-500 from partial prompts yet scores essentially zero on a benchmark released after its training cutoff — so much of the apparent 'gain' on dirty benchmarks is memorization, and on clean ones only genuinely correct rewards help while random or inverted rewards stall or degrade it Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So clean benchmarks do their first job well: they separate recall from reasoning.
But here's the twist the corpus insists on — behavioral activation and benchmark score are *different measurement levels* that can both be real at once. RLVR can switch on genuine reasoning patterns through training even while a contaminated score is rising for unrelated memorization reasons; the two don't contradict each other, which is exactly why a single number can't adjudicate the question Can genuine reasoning activation coexist with contaminated benchmarks?. A clean benchmark removes the memorization confound, but it still only measures pass/fail outcomes, not whether the reasoning *trace* is sound.
And that's where even clean accuracy misleads. RLVR measurably tightens local coherence — fewer logical jumps between adjacent steps — without guaranteeing the proof is globally valid; you can get a locally tidy, globally wrong argument that still lands the right answer Does RLVR actually improve mathematical reasoning or just coherence?. Several notes converge on a deflationary reading of what the 'gain' even is: RLVR mostly improves *sampling efficiency*, concentrating probability on solutions already inside the base model's distribution rather than expanding what's solvable — pass@k analysis shows base models overtaking RLVR models at high k Does RLVR actually expand what models can reason about?, and the boundary can actively *collapse* as exploitation crowds out exploration Why does RLVR training narrow a model's problem solving ability?. The reason spurious rewards 'work' at all is that they surface latent pretraining behaviors — Qwen gains from random rewards, Llama and OLMo don't — meaning the score reflects what pretraining loaded in, not what RL taught Why do random rewards improve reasoning for some models but not others? What does reward learning actually do to model reasoning?.
So the honest answer: a clean benchmark is necessary but not sufficient. It defeats the memorization illusion, but 'true reasoning gain' needs richer instruments than final-answer accuracy — trace-validity checks, step-level confidence that catches breakdowns global averaging hides Does step-level confidence outperform global averaging for trace filtering?, and pass@k curves that ask whether the *capability boundary* moved rather than whether sampling got luckier. Worth knowing: the same training that looks like a gain can quietly amplify shortcuts — train on near-impossible problems and the model learns answer-repetition and computation-skipping that contaminate skills it already had Do overly hard RLVR samples actually harm model capabilities?. The benchmark stays clean; the reasoning doesn't.
Sources 9 notes
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.