Reinforcement Learning for LLMs

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Explores whether RLVR's apparent effectiveness with spurious rewards on contaminated benchmarks like MATH-500 represents actual reasoning gains or merely data memorization retrieval.

Note · 2026-02-23 · sourced from Flaws
What does reward learning actually do to model reasoning?

The apparent success of RLVR with random, incorrect, or spurious reward signals on Qwen models may be an artifact of data contamination rather than evidence of genuine reasoning improvement.

The contamination evidence: prompting Qwen2.5-Math-7B with the first 60% of each MATH-500 question yields 54.6% exact-match reconstruction of the remaining 40% and 53.6% correct answers to these incomplete problems. On LiveMathBench — a benchmark released after Qwen2.5 — completion rate drops to 0.0%, consistent with Llama3.1-8B (3.8%/0.0% respectively). The model has memorized MATH-500.

On a fully clean benchmark (RandomCalculation — synthetic arithmetic expressions generated after Qwen's release): correct rewards deliver consistent gains surpassing the model's performance ceiling; random rewards make training highly unstable with no reliable improvement; inverse rewards rapidly erode mathematical reasoning ability.

This directly challenges Why do random rewards improve reasoning for some models but not others?. The prior interpretation — that any optimization pressure activates pretraining strategies — may confound two effects: genuine strategy activation (possible) and recall of memorized answers triggered by format-similar optimization (likely for contaminated benchmarks). On clean data, the "any reward works" finding evaporates for random and inverse signals.

The practical implication: RLVR research conclusions drawn from MATH-500 and similar benchmarks for Qwen models should be interpreted with caution. Reward engineering may matter more than the spurious-reward literature suggests — we were measuring memorization recovery, not reasoning improvement.


Source: Flaws

Related concepts in this collection

Concept map
13 direct connections · 100 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

RLVR effectiveness on contaminated benchmarks is primarily data memorization — clean benchmarks eliminate spurious reward gains