Can contamination-free evaluation distinguish between memorization and genuine prediction ability?

This explores whether testing models on clean, never-before-seen data can actually tell apart "the model memorized the answer" from "the model can genuinely reason its way to a new answer" — and what the corpus reveals about how hard that separation really is.

This explores whether contamination-free evaluation — testing on data the model couldn't have seen during training — can cleanly separate memorization from genuine prediction ability. The corpus suggests it can expose the difference dramatically, but it also reveals that memorization and reasoning aren't a clean binary in the first place, which complicates the whole project.

The sharpest evidence that clean evaluation matters comes from math benchmarks. One striking result: a model could reconstruct 54.6% of a popular math test from partial prompts alone — proof it had simply absorbed the answers — yet scored 0.0% on a benchmark released *after* its training cutoff Does RLVR success on math benchmarks reflect genuine reasoning improvement?. The post-release benchmark is the contamination-free probe, and it instantly distinguished a model that looked brilliant on paper from one that couldn't actually do the work. The same finding showed that on clean data, only genuine correctness signals improved performance, while random rewards did nothing — exactly what you'd expect if the earlier gains were recall, not reasoning.

But here's the thing the reader might not expect: memorization and genuine prediction aren't two separate buckets a model falls into — they happen *simultaneously, inside the same answer.* A study that decomposed chain-of-thought reasoning into three independent ingredients found that sheer output probability could swing accuracy from 26% to 70%, that memorization tracked how often patterns appeared in training, and that real step-by-step reasoning existed too but accumulated error at every step What three separate factors drive chain-of-thought performance?. So a single "correct" answer can be part recall, part lucky token statistics, and part actual inference. A clean benchmark removes the recall shortcut — but it doesn't tell you which of the remaining factors carried the load.

That's why some of the most interesting work goes *inside* the model rather than just swapping the test set. Memorized passages leave a physical fingerprint — bigger gradients in lower layers and a specific attention head fixating on rare tokens Where does a model store memorized paragraphs? — and reasoning errors trace back to identifiable memorization sources, with "local" memorization off the immediately preceding tokens causing up to 67% of failures Where do memorization errors arise in chain-of-thought reasoning?. These approaches diagnose memorization mechanistically, sidestepping the question of whether your test data is truly uncontaminated. Relatedly, whether a fact gets memorized at all is surprisingly predictable from its probability before training even happens Can we predict keyword priming before learning happens? — which suggests contamination effects could in principle be anticipated, not just caught after the fact.

The deeper warning the corpus offers: evaluation can be fooled at the *surface* in ways clean data alone won't fix. Models trained to imitate ChatGPT learned its confident, fluent style and fooled human judges while closing no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?, and even a deterministic, zero-temperature setup that produces the same answer every time gives you consistency that is not the same thing as reliability deterministic-llm-settings-create-fixed-randomness-not-reliable-a-single-outp. So contamination-free evaluation is necessary — it strips away the most blatant form of cheating — but it's not sufficient on its own. The honest answer is that clean test sets catch memorization-as-leakage, mechanistic probes catch memorization-as-mechanism, and you likely need both to be confident you're measuring genuine prediction rather than a convincing echo.

Sources 7 notes

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Where does a model store memorized paragraphs?

Memorized paragraphs leave a distinctive fingerprint in GPT-Neo: larger gradients in lower layers, concentration in a specific low-layer attention head attending to rare tokens, and dependence on a few early-prefix tokens. This localization makes memorization targetable for unlearning.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Can contamination-free evaluation distinguish between memorization and genuine prediction ability?

Sources 7 notes

Next inquiring lines