INQUIRING LINE

How does task contamination differ from test set data leakage?

This explores two failure modes that both get loosely called 'contamination' but live at opposite ends of the model lifecycle — static benchmark leakage (eval answers leaking into training data) versus runtime task contamination (a model's own working context filling up with errors mid-job).


This question separates two things the word 'contamination' tends to blur. Test set data leakage is a *static, pre-runtime* problem: benchmark questions or answers end up in the training corpus, so a high score reflects memorization rather than capability. Task contamination is a *dynamic, runtime* problem: as a model works through a long job, its own prior outputs — including mistakes — pollute its context and bias everything that follows. One corrupts what the model *learned*; the other corrupts what the model is *currently doing*.

The cleanest illustration of the leakage side comes from work showing that benchmark improvement and genuine capability can be entirely separate phenomena Can genuine reasoning activation coexist with contaminated benchmarks?. RLVR training can activate real reasoning patterns *and* a benchmark number can climb purely because of memorization on a contaminated dataset — the two operate at different measurement levels and can coexist without contradiction. That's the unsettling part: a leaked test set doesn't announce itself by breaking the model; it quietly inflates the scoreboard while the underlying skill is unchanged. A related caution shows up in how instruction tuning works — models trained on semantically empty or deliberately wrong instructions score about as well as those given correct ones, meaning the benchmark may be measuring familiarity with the output format rather than task understanding Does instruction tuning teach task understanding or output format?. Both cases point to the same lesson: a good score can be an artifact of what the model has already seen, not what it can do.

Task contamination is a fundamentally different beast because it emerges *during* execution and compounds. When a model's earlier errors sit in its context window, performance degrades non-linearly — and scaling the model doesn't fix it; only test-time 'thinking' compute helps, by preventing the error-laced context from biasing fresh reasoning Do models fail worse when their own errors fill the context?. You can watch this play out in long delegated workflows, where frontier models silently corrupt roughly 25% of document content across extended relay tasks, with errors accumulating round after round without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?. Nobody injected bad data; the model contaminated itself.

So the sharp distinction is this: test set leakage is a *measurement* failure (your evaluation lies to you about capability), while task contamination is an *execution* failure (the system degrades itself in real time). Leakage is fixed upstream by curating training data and decontaminating benchmarks; task contamination is fixed downstream by managing context, filtering low-confidence steps before they propagate Does step-level confidence outperform global averaging for trace filtering?, or spending inference-time compute to avoid conditioning on prior mistakes.

The thing worth taking away: both are 'contamination' only by analogy, and conflating them leads to the wrong fix. If your worry is whether a benchmark is trustworthy, you're chasing leakage and the answer is in data provenance. If your worry is why a model that aced the benchmark falls apart on a 50-step real task, you're chasing task contamination — and no amount of clean training data will save you, because the corruption is being generated live by the model itself.


Sources 5 notes

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Next inquiring lines