Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Paper · arXiv 2507.10532 · Published July 14, 2025
FlawsMemoryReasoning CritiquesLLM Architecture

To obtain trustworthy evaluation signals, we introduce a generator that creates fully synthetic arithmetic problems of arbitrary length and difficulty, yielding clean datasets we call RandomCalculation. Using these leakage-free datasets, we further show that under the RL protocol, only accurate reward signals yield steady improvements that surpass the model’s performance ceiling in mathematical reasoning, whereas noisy or incorrect rewards do not.

To assess the extent of potential data contamination in popular mathematical benchmarks, we evaluate two indicators: partial-prompt completion rate (can the model reconstruct the tail of a problem?) and partial-prompt answer accuracy (can the model give the correct answer with an incomplete problem?). As Figure 1 illustrates, we observe that Qwen can accurately complete the problem statement and provide the correct answer, whereas Llama does not. Specifically, prompting Qwen2.5-Math-7B with the first 60% of each MATH-500 question, we find that it regenerates the remaining 40% with a 54.60% exact-match rate and still answers 53.6% of these incomplete problems correctly. Llama3.1-8B, in contrast, scores 3.8% and 2.4% on both metrics. Crucially, on the newly released LiveMathBench (version 202505) (Liu et al., 2024), a post-hoc benchmark compiled after the release of the Qwen2.5 model family, Qwen’s completion rate drops sharply to 0.0%, consistent with Llama’s 0.0%. Its partial-prompt answer accuracy also falls to just 2.0%, comparable to Llama’s 1.0%. These results confirm that the earlier gains on MATH-500 may stem from memorized content rather than genuine reasoning. Hence, results derived from MATH-500 and similar datasets for Qwen models should be interpreted with caution.

Building on this evidence, we attribute that data contamination is the chief factor behind the seemingly “magical” success of random-reward or few-shot RLVR variants on Qwen models. To test this claim, we create a fully fresh benchmark (representative examples shown in Figure 2): We use an automatic generator to construct arithmetic expressions of arbitrary length with uniformly random operands and operators, guaranteeing that every instance post-dates the public release of Qwen2.5. A zero-shot evaluation on this benchmark shows no memorization: the accuracy of Qwen2.5 models declines monotonically with the number of computation steps, leaving ample room for improvement. To isolate the effect of reward quality, we next trained Qwen2.5-Math-7B under the standard RLVR protocol on two leakage-free subsets, respectively. The outcome is unambiguous: Correct rewards deliver consistent performance gains, surpassing the model’s performance ceiling; In contrast, random rewards make training highly unstable, yielding no reliable improvement, while inverse rewards rapidly erode the model’s mathematical-reasoning ability. These results rule out the “Strong Baseline Math Skills” explanation and directly implicate “Data Contamination”: once leakage is removed, the prior gains evaporate.