Language Understanding and Pragmatics Design & LLM Interaction LLM Reasoning and Architecture

Why do LLMs generate novel ideas from narrow ranges?

LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.

Note · 2026-02-21 · sourced from Discourses
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The LLM research ideation study identifies diversity collapse as a primary failure mode for LLM research agents, distinct from the average novelty finding. Individual LLM-generated ideas may be rated as novel by human reviewers, but the set of ideas generated lacks diversity — they cluster around a narrow generative range.

This is a familiar pattern from other LLM generation tasks: the model finds high-probability regions of the output space that satisfy the novelty criteria locally, then repeatedly samples from those regions. High average quality does not guarantee diverse coverage.

For research ideation specifically, diversity collapse is a practical problem: the point of idea generation is to explore the possibility space, not to generate multiple instances of the same high-novelty cluster. Ten variations on the same structural idea are less valuable than ten ideas from different conceptual territories, even if the former batch is individually more novel.

The study also identifies a second failure mode: LLM self-evaluation failures. Models cannot accurately assess the quality of their own generated ideas. This means automated pipelines that use LLM self-scoring as a quality filter will misestimate which ideas are worth pursuing — the model's own judgment of its outputs is unreliable.

The combination is particularly damaging: diversity collapse means the search space is poorly covered, and self-evaluation failures mean the model cannot compensate by identifying which of its narrow outputs are the most promising.

LLM creativity may have peaked. "Has the Creativity of Large-Language Models Peaked?" tests inter- and intra-LLM variability on the Divergent Association Task (DAT) and Alternative Uses Task (AUT). GPT-4o — previously benchmarked in 2023 as GPT-4 — performed substantially worse on the DAT, suggesting regression rather than progress. Even on the AUT, only 0.28% of responses reached the 90th percentile of human creativity — humans are 35.7x more likely to produce standout ideas. LLMs generate mid-level novelty reliably but rarely produce radical or conceptual creativity, reinforcing combinatorial rather than transformative creativity. Prompt design emerged as a significant modulator: disclosing the creative test context improved some models while worsening others, suggesting creativity in LLMs is partly prompt-contingent rather than an inherent capacity.

The Catfish Agent paper (multi-agent clinical reasoning) provides a mechanism: Why do multi-agent LLM systems converge without real debate?. In multi-agent systems, 61%+ of iterations converge through social accommodation rather than reasoning. The same dynamics that produce diversity collapse in single-model ideation operate even more powerfully in multi-agent contexts — agents accommodate each other's initial frames, preventing the genuine disagreement that would drive coverage of different conceptual territory. The pattern holds across creative ideation (individual LLM), clinical reasoning (multi-agent LLM), and RL training dynamics (Does policy entropy collapse limit reasoning performance in RL?).


Source: Discourses

Related concepts in this collection

Concept map
18 direct connections · 141 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llm research ideation suffers from diversity collapse despite high average novelty