How much of LLM few-shot ability comes from training data?
Do large language models genuinely learn from a few examples, or are they mostly recognizing patterns from their training data? This matters for understanding what LLMs can actually do.
Task contamination is distinct from test data contamination. Test data contamination means specific test examples leaked into training data. Task contamination means training examples for the evaluated task were included in pretraining data — "effectively making the evaluation no longer zero or few-shot."
The evidence is systematic: "on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after." Training data inspection, task example extraction, and membership inference attacks all confirm the contamination. The temporal pattern is the strongest signal — the same models perform differently on the same task depending on whether the evaluation dataset predates their training cutoff.
The devastating finding: "for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings." This challenges the entire few-shot learning narrative. If few-shot capabilities largely reflect having seen task-specific training examples during pretraining, then the celebrated generalization of LLMs may be narrower than assumed.
This is distinct from the RLVR contamination problem. Since Does RLVR success on math benchmarks reflect genuine reasoning improvement?, that finding is specific to reward-based training: models memorize benchmark answers and RL rewards amplify the memorization signal. Task contamination is broader — it affects the foundational claim that LLMs can generalize from a few examples to new tasks. The RLVR contamination shows rewards don't help beyond memorization; task contamination shows the "few-shot" framing may itself be an illusion.
The practical implication is methodological: any evaluation of LLM capabilities on tasks that existed before the model's training data cutoff must be treated with suspicion. Since Do popular prompting techniques actually improve model performance?, contamination adds a second mechanism for irreproducibility: not only do prompting techniques fail to replicate, but baseline measurements may be inflated by contamination. The replication crisis compounds.
The manhole cover analogy is apt: "Think of that infamous 'Why are manhole covers round?' interview question. While it may well have given the interviewer an insight into the candidate's analytical reasoning skills the very first time it was asked, all it does with high probability now is to confirm whether the candidate trained on the interview question banks."
Source: Tasks Planning Paper: Task Contamination: Language Models May Not Be Few-Shot Anymore
Related concepts in this collection
-
Does RLVR success on math benchmarks reflect genuine reasoning improvement?
Explores whether RLVR's apparent effectiveness with spurious rewards on contaminated benchmarks like MATH-500 represents actual reasoning gains or merely data memorization retrieval.
RLVR-specific contamination; this extends to the broader few-shot evaluation paradigm
-
Do popular prompting techniques actually improve model performance?
Five widely-cited prompting methods (chain-of-thought, emotion prompting, sandbagging, and others) are tested across multiple models and benchmarks to see if their reported improvements hold up under rigorous statistical analysis.
contamination adds a second irreproducibility mechanism
-
Are LLM emergent abilities real or measurement artifacts?
Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.
third challenge to capabilities narrative: emergent abilities are metric artifacts + task contamination inflates baselines + prompting doesn't replicate
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
task contamination makes zero-shot and few-shot LLM evaluation unreliable — on uncontaminated classification tasks LLMs rarely beat majority baselines