Reinforcement Learning for LLMs LLM Reasoning and Architecture Language Understanding and Pragmatics

How much of LLM few-shot ability comes from training data?

Do large language models genuinely learn from a few examples, or are they mostly recognizing patterns from their training data? This matters for understanding what LLMs can actually do.

Note · 2026-03-30 · sourced from Tasks Planning
Do reasoning traces show how models actually think?

Task contamination is distinct from test data contamination. Test data contamination means specific test examples leaked into training data. Task contamination means training examples for the evaluated task were included in pretraining data — "effectively making the evaluation no longer zero or few-shot."

The evidence is systematic: "on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after." Training data inspection, task example extraction, and membership inference attacks all confirm the contamination. The temporal pattern is the strongest signal — the same models perform differently on the same task depending on whether the evaluation dataset predates their training cutoff.

The devastating finding: "for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings." This challenges the entire few-shot learning narrative. If few-shot capabilities largely reflect having seen task-specific training examples during pretraining, then the celebrated generalization of LLMs may be narrower than assumed.

This is distinct from the RLVR contamination problem. Since Does RLVR success on math benchmarks reflect genuine reasoning improvement?, that finding is specific to reward-based training: models memorize benchmark answers and RL rewards amplify the memorization signal. Task contamination is broader — it affects the foundational claim that LLMs can generalize from a few examples to new tasks. The RLVR contamination shows rewards don't help beyond memorization; task contamination shows the "few-shot" framing may itself be an illusion.

The practical implication is methodological: any evaluation of LLM capabilities on tasks that existed before the model's training data cutoff must be treated with suspicion. Since Do popular prompting techniques actually improve model performance?, contamination adds a second mechanism for irreproducibility: not only do prompting techniques fail to replicate, but baseline measurements may be inflated by contamination. The replication crisis compounds.

The manhole cover analogy is apt: "Think of that infamous 'Why are manhole covers round?' interview question. While it may well have given the interviewer an insight into the candidate's analytical reasoning skills the very first time it was asked, all it does with high probability now is to confirm whether the candidate trained on the interview question banks."


Source: Tasks Planning Paper: Task Contamination: Language Models May Not Be Few-Shot Anymore

Related concepts in this collection

Concept map
14 direct connections · 133 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

task contamination makes zero-shot and few-shot LLM evaluation unreliable — on uncontaminated classification tasks LLMs rarely beat majority baselines