How much of MATH-500 improvement comes from data contamination versus real reasoning gains?
This explores whether higher MATH-500 scores reflect models actually getting better at reasoning, or just having seen the test — and the corpus suggests the honest answer is 'partly both, and they're hard to tell apart.'
This explores whether gains on MATH-500 measure real reasoning or just memorization — and the corpus is unusually direct about the trap. The sharpest evidence: Qwen2.5-Math-7B can reconstruct 54.6% of MATH-500 just from partial prompts, yet scores 0.0% on LiveMathBench, a benchmark released after the model was trained Does RLVR success on math benchmarks reflect genuine reasoning improvement?. That gap is the whole story in miniature: a benchmark the model may have ingested looks like reasoning; a clean one it couldn't have seen exposes how little transferred. On contaminated benchmarks, the gains are mostly recall.
But 'mostly recall' isn't the same as 'nothing real.' One note argues the two effects are genuinely separable: RLVR can activate authentic reasoning behaviors while the headline benchmark number is simultaneously inflated by memorization — they operate at different measurement levels and can coexist without contradiction Can genuine reasoning activation coexist with contaminated benchmarks?. So the question 'how much is contamination vs. reasoning' has a hidden assumption — that it's one pie split two ways. It may be two different things being measured by one number.
What's striking is how thin 'real reasoning gains' turn out to be even when contamination isn't the issue. RLVR makes reasoning traces more locally coherent — fewer logical jumps between adjacent steps — without making the overall proof valid Does RLVR actually improve mathematical reasoning or just coherence?. A single training example can lift math accuracy from 36% to 73.6%, which sounds like learning but looks more like flipping a switch on latent capability the model already had Can a single training example unlock mathematical reasoning?. And supervised fine-tuning raises final-answer accuracy while degrading the quality of the reasoning by ~39% — the model reaches right answers through pattern-matching shortcuts, not inference Does supervised fine-tuning actually improve reasoning quality?.
The most unsettling thread: maybe accuracy on these benchmarks was never measuring reasoning to begin with. Models trained on deliberately corrupted, irrelevant reasoning traces perform comparably to those trained on correct ones Do reasoning traces need to be semantically correct?, and logically invalid chain-of-thought exemplars match valid ones on hard benchmarks Does logical validity actually drive chain-of-thought gains?. If the form of reasoning drives the score regardless of its validity, then a clean MATH-500 number can't cleanly separate 'reasoning' from 'pattern fluency' either.
So the practical takeaway is to stop trusting a single MATH-500 delta. The corpus's recurring move is to triangulate: test on post-release benchmarks the model couldn't have memorized Does RLVR success on math benchmarks reflect genuine reasoning improvement?, measure reasoning informativeness rather than just accuracy Does supervised fine-tuning actually improve reasoning quality?, and separate behavioral activation from benchmark movement Can genuine reasoning activation coexist with contaminated benchmarks?. The thing you didn't know you wanted to know: even after you subtract contamination, the 'reasoning gain' that's left may be coherence and form rather than genuine inference.
Sources 7 notes
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.