Do few-shot examples improve in-context learning or add noise?

This explores whether the examples you put in a prompt actually teach the model anything — or whether some of them just confuse it — and what the corpus says about making them help instead of hurt.

This explores whether few-shot examples genuinely improve in-context learning or sometimes just add noise — and the corpus answer is that it depends entirely on *which* examples and *how* they're arranged, not on how many you cram in. The most striking thread is that examples aren't all worth the same. Optimal experimental design beats simple similarity-based retrieval by picking demonstrations that most reduce uncertainty about the test set, which implies that naively grabbing the most 'similar' examples can leave performance on the table Can optimal experimental design improve few-shot example selection?. Ordering matters too: arranging demonstrations from harder to easier — measured by how sparsely the model represents them internally — yields real gains with no extra labels Can representation sparsity order few-shot demonstrations effectively?. So the same set of examples can help or hurt depending on selection and sequence.

The more counterintuitive finding is that *wrong* examples can teach more than right ones. The LEAP method deliberately induces the model to err on its few-shot examples, then has it reflect on the mistakes and write down explicit task principles — and this beats using clean correct demonstrations Does learning from mistakes improve in-context learning?. That reframes the whole question: an example's value isn't 'is it a correct answer' but 'does it surface a principle the model can articulate.' A related lesson from argument-quality work is that labeled examples alone often teach surface patterns rather than the underlying criteria — models need explicit frameworks, not just more instances, to generalize Can models learn argument quality from labeled examples alone?.

For some tasks, individual examples are the wrong unit entirely. In sequential decision-making, isolated demonstrations don't enable in-context learning at all — the model needs full or partial *trajectories* from the same environment, a property called trajectory burstiness Why do trajectories matter more than individual examples for in-context learning?. Here, scattering disconnected examples genuinely is noise; only coherent runs carry the signal. And at the extreme, the question of 'how many examples' can collapse: a single training example can be enough to activate latent reasoning ability that was already in the model Can a single training example unlock mathematical reasoning?, echoing the finding that even three exposures suffice to establish a priming effect Can we predict keyword priming before learning happens?.

The deepest reason examples sometimes add nothing is a hard ceiling: prompting can only reorganize and activate knowledge that already exists in the model's training, never inject what's missing Can prompt optimization teach models knowledge they lack?. And examples can be actively overridden — when a model's prior training associations are strong, it will ignore the context in front of it, so demonstrations that contradict deeply learned priors get drowned out rather than learned Why do language models ignore information in their context?. One way researchers fight this fragility is consistency training, which teaches models to respond the same way regardless of irrelevant wrapping or perturbation in the prompt — directly attacking the 'noise' side of the question Can models learn to ignore irrelevant prompt changes?.

The takeaway you didn't know you wanted: few-shot examples are best understood not as data the model trains on, but as *keys that unlock or fail to unlock* capabilities already inside it. Good examples are the ones that activate the right latent skill or expose a usable principle; noisy ones are those that fight the model's priors, sit as disconnected fragments where coherent trajectories were needed, or try to supply knowledge that was never there to begin with.

Sources 10 notes

Can optimal experimental design improve few-shot example selection?

AIPD frames demonstration selection as budgeted active learning, choosing examples that maximally reduce test-set uncertainty. Two algorithms (GO and SAL) outperformed similarity-based methods across small, medium, and large language models.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Does learning from mistakes improve in-context learning?

LEAP demonstrates that models achieve better performance on reasoning and math tasks by intentionally erring on few-shot examples, reflecting on mistakes, and deriving explicit task-specific principles—without additional labeled data or fine-tuning.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Do few-shot examples improve in-context learning or add noise?

Sources 10 notes

Next inquiring lines