Can clean benchmarks reveal true RLVR reasoning gains?

This explores whether stripping out benchmark contamination (clean, post-release test sets) lets us see RLVR's real reasoning gains — and what the corpus says is actually moving when it does.

This explores whether clean benchmarks can reveal RLVR's true reasoning gains — the answer the corpus gives is more interesting than yes or no: clean benchmarks expose what's *not* reasoning, but they don't show RLVR adding new reasoning either. The clearest demonstration is the contamination test itself. A model like Qwen2.5-Math-7B can reconstruct over half of MATH-500 from partial prompts yet scores essentially zero on a benchmark released after its training cutoff — so much of the apparent 'gain' on dirty benchmarks is memorization, and on clean ones only genuinely correct rewards help while random or inverted rewards stall or degrade it Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So clean benchmarks do their first job well: they separate recall from reasoning.

But here's the twist the corpus insists on — behavioral activation and benchmark score are *different measurement levels* that can both be real at once. RLVR can switch on genuine reasoning patterns through training even while a contaminated score is rising for unrelated memorization reasons; the two don't contradict each other, which is exactly why a single number can't adjudicate the question Can genuine reasoning activation coexist with contaminated benchmarks?. A clean benchmark removes the memorization confound, but it still only measures pass/fail outcomes, not whether the reasoning *trace* is sound.

And that's where even clean accuracy misleads. RLVR measurably tightens local coherence — fewer logical jumps between adjacent steps — without guaranteeing the proof is globally valid; you can get a locally tidy, globally wrong argument that still lands the right answer Does RLVR actually improve mathematical reasoning or just coherence?. Several notes converge on a deflationary reading of what the 'gain' even is: RLVR mostly improves *sampling efficiency*, concentrating probability on solutions already inside the base model's distribution rather than expanding what's solvable — pass@k analysis shows base models overtaking RLVR models at high k Does RLVR actually expand what models can reason about?, and the boundary can actively *collapse* as exploitation crowds out exploration Why does RLVR training narrow a model's problem solving ability?. The reason spurious rewards 'work' at all is that they surface latent pretraining behaviors — Qwen gains from random rewards, Llama and OLMo don't — meaning the score reflects what pretraining loaded in, not what RL taught Why do random rewards improve reasoning for some models but not others? What does reward learning actually do to model reasoning?.

So the honest answer: a clean benchmark is necessary but not sufficient. It defeats the memorization illusion, but 'true reasoning gain' needs richer instruments than final-answer accuracy — trace-validity checks, step-level confidence that catches breakdowns global averaging hides Does step-level confidence outperform global averaging for trace filtering?, and pass@k curves that ask whether the *capability boundary* moved rather than whether sampling got luckier. Worth knowing: the same training that looks like a gain can quietly amplify shortcuts — train on near-impossible problems and the model learns answer-repetition and computation-skipping that contaminate skills it already had Do overly hard RLVR samples actually harm model capabilities?. The benchmark stays clean; the reasoning doesn't.

Sources 9 notes

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-evaluating a question about clean benchmarks in RLVR (Reinforcement Learning on Verified Reasoning): Can they reveal true reasoning gains, or only surface artifacts?

What a curated library found — and when (2025–2026, dated claims not current truth):
• Clean benchmarks defeat memorization confounds (Qwen2.5-Math-7B reconstructs >50% of MATH-500 from partial prompts but scores ~0 on held-out data), but only measure pass/fail outcomes, not reasoning trace validity (~2025).
• RLVR improves local step coherence without guaranteeing global proof validity; models produce locally tidy, globally wrong arguments that still land correct answers (~2025).
• RLVR concentrates probability on solutions within the base model's distribution rather than expanding capability; pass@k analysis shows base models overtaking RLVR models at high k, and capability boundaries can collapse as exploitation crowds out exploration (~2025).
• Spurious (random/inverted) rewards improve RLVR scores in some models (Qwen) but not others (Llama, OLMo), reflecting pretraining composition rather than learned reasoning (~2025).
• Overly hard RLVR samples induce degenerate behaviors (answer repetition, computation-skipping) that amplify shortcuts in already-learned skills (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2507.10532 — Reasoning or Memorization (2025-07)
• arXiv:2504.13837 — Does RL Really Incentivize Reasoning Capacity (2025-04)
• arXiv:2510.18176 — Local Coherence or Global Validity (2025-10)
• arXiv:2508.15260 — Deep Think with Confidence (2025-08)

Your task:
(1) RE-TEST EACH CONSTRAINT. For memorization confounds, capability saturation, trace validity gaps, and spurious-reward brittleness: have newer models (post-mid-2025 training), improved RL curricula, or finer-grained evaluation harnesses (step-level confidence filtering, mechanistic trace inspection) since relaxed or overturned these findings? Separate the durable question (what 'true reasoning' even means in closed-form benchmarks) from the perishable limitation (which artifacts can now be filtered). Cite what resolved it.
(2) Surface the strongest contradicting or superseding work from the last ~6 months (assume library cutoff ~Oct 2025). Has work on negative RL, cognitive tools, or multi-agent reasoning orchestration changed the regime?
(3) Propose 2 research questions that ASSUME the measurement and training regime may have moved — e.g., "Can step-level validity filters on in-context reasoning traces (not final answers) isolate genuine capability expansion?" or "Do curriculum-scheduled sample difficulty (avoiding both trivial and degenerate regimes) + mechanistic interpretation recover reasoning gains that clean benchmarks miss?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can clean benchmarks reveal true RLVR reasoning gains?

Sources 9 notes

Next inquiring lines