Can few-shot examples narrow generative diversity in creative tasks?

This explores whether giving an LLM a handful of sample outputs (few-shot prompting / in-context learning) anchors it to those samples and shrinks the variety of what it produces on open-ended creative work — and the corpus answers the broader 'what collapses diversity' question more directly than the few-shot-specific one.

This explores whether feeding an LLM a few worked examples quietly narrows the range of what it generates on creative tasks. The honest starting point: the collection has rich material on *why* generative diversity collapses, but only thin coverage of few-shot examples as the specific lever — so the strongest answer comes from triangulating the surrounding mechanisms rather than from one paper that names the effect.

The most useful surprise is that 'examples narrow diversity' isn't a clean yes. Diversity effects flip by domain. Preference tuning *reduces* lexical and syntactic variety in code generation but *increases* it in creative writing, because code rewards converging on one correct answer while creative writing rewards standing out Does preference tuning always reduce diversity the same way?. By that logic, a few-shot example that would tightly anchor a coding task might do something different in a creative one — examples there can demonstrate that distinctiveness is the goal rather than fence the model into a single mold. So the answer depends on what your examples implicitly signal the task is *for*.

The deeper risk the corpus does document is that models converge on their own even without your examples. Across 70+ models and 26K open-ended queries, different LLMs independently produce strikingly similar outputs — an 'Artificial Hivemind' driven by overlapping training data and shared alignment Do different AI models actually produce diverse outputs?. And larger models concentrate probability mass on their preferred outputs, so they generate fewer distinct samples per draw than much smaller ones Why aren't bigger models better for generating diverse outputs?. Few-shot examples would plausibly *compound* this baseline pull toward the mode: you're handing the model an anchor on top of an architecture already biased toward its favorite answer.

The closest the corpus comes to few-shot directly is work on ordering in-context demonstrations — sequencing them from harder to easier improves performance Can representation sparsity order few-shot demonstrations effectively?. Notably that's framed around accuracy, not diversity, which mirrors a field-wide blind spot: most reasoning and ideation methods optimize for getting the conventional answer right and ignore the distinct creative modes (combinational, exploratory, transformational) where diversity actually lives — a gap the corpus argues may itself explain ideation collapse Can LLMs reason creatively beyond conventional problem-solving?.

What you didn't know you wanted to know: the same narrowing shows up under many names — entropy collapse in RL search agents Does reinforcement learning squeeze exploration diversity in search agents?, format collapse where RL amplifies one pretraining format in the first epoch Does RL training collapse format diversity in pretrained models? — and the documented *fixes* point at what an anti-collapse few-shot strategy should look like. Diversity is preserved by training on varied demonstrations rather than a narrow set Does reinforcement learning squeeze exploration diversity in search agents?, by step-level critique that counteracts 'tail narrowing' before it sets in Do critique models improve diversity during training itself?, and by deliberately layering variation (persona, subtopic, context) so the examples themselves carry breadth instead of collapsing it Can synthetic dialogues become realistic through layered diversity?. The lesson: a few homogeneous examples will likely narrow you; a deliberately heterogeneous set is the same tool pointed the other way.

Sources 9 notes

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can synthetic dialogues become realistic through layered diversity?

Research shows that realistic synthetic dialogues require three multiplicative layers: subtopic specificity, Big Five persona variation, and 11 contextual characteristics via Chain of Thought reasoning. This structured approach captures 90.48% of in-domain dialogue performance.

Can few-shot examples narrow generative diversity in creative tasks?

Sources 9 notes

Next inquiring lines