Why do LLMs generate novel ideas from narrow ranges?
LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
The LLM research ideation study identifies diversity collapse as a primary failure mode for LLM research agents, distinct from the average novelty finding. Individual LLM-generated ideas may be rated as novel by human reviewers, but the set of ideas generated lacks diversity — they cluster around a narrow generative range.
This is a familiar pattern from other LLM generation tasks: the model finds high-probability regions of the output space that satisfy the novelty criteria locally, then repeatedly samples from those regions. High average quality does not guarantee diverse coverage.
For research ideation specifically, diversity collapse is a practical problem: the point of idea generation is to explore the possibility space, not to generate multiple instances of the same high-novelty cluster. Ten variations on the same structural idea are less valuable than ten ideas from different conceptual territories, even if the former batch is individually more novel.
The study also identifies a second failure mode: LLM self-evaluation failures. Models cannot accurately assess the quality of their own generated ideas. This means automated pipelines that use LLM self-scoring as a quality filter will misestimate which ideas are worth pursuing — the model's own judgment of its outputs is unreliable.
The combination is particularly damaging: diversity collapse means the search space is poorly covered, and self-evaluation failures mean the model cannot compensate by identifying which of its narrow outputs are the most promising.
LLM creativity may have peaked. "Has the Creativity of Large-Language Models Peaked?" tests inter- and intra-LLM variability on the Divergent Association Task (DAT) and Alternative Uses Task (AUT). GPT-4o — previously benchmarked in 2023 as GPT-4 — performed substantially worse on the DAT, suggesting regression rather than progress. Even on the AUT, only 0.28% of responses reached the 90th percentile of human creativity — humans are 35.7x more likely to produce standout ideas. LLMs generate mid-level novelty reliably but rarely produce radical or conceptual creativity, reinforcing combinatorial rather than transformative creativity. Prompt design emerged as a significant modulator: disclosing the creative test context improved some models while worsening others, suggesting creativity in LLMs is partly prompt-contingent rather than an inherent capacity.
The Catfish Agent paper (multi-agent clinical reasoning) provides a mechanism: Why do multi-agent LLM systems converge without real debate?. In multi-agent systems, 61%+ of iterations converge through social accommodation rather than reasoning. The same dynamics that produce diversity collapse in single-model ideation operate even more powerfully in multi-agent contexts — agents accommodate each other's initial frames, preventing the genuine disagreement that would drive coverage of different conceptual territory. The pattern holds across creative ideation (individual LLM), clinical reasoning (multi-agent LLM), and RL training dynamics (Does policy entropy collapse limit reasoning performance in RL?).
Source: Discourses
Related concepts in this collection
-
Do language models generate more novel research ideas than experts?
Explores whether LLMs can break free from expert constraints to generate more novel research concepts. Matters because novelty is often thought to be AI's creative blind spot.
the novelty finding this is the complication to
-
Why do LLMs generate more novel research ideas than experts?
LLM-generated research ideas are statistically more novel than those from 100+ expert researchers, but the mechanisms behind this advantage and its practical implications remain unclear. Understanding this paradox could reshape how we use AI in creative knowledge work.
writing angle
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
same mechanism at a different scale: optimization pressure (RL reward; quality preference) narrows LLM output diversity whether at training time or generation time
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
extends: research ideation diversity collapse is a third manifestation of the same entropy/diversity collapse pattern across LLM optimization contexts
-
Can LLMs reason creatively beyond conventional problem-solving?
Explores whether large language models can engage in truly creative reasoning that expands or redefines solution spaces, rather than just decomposing known problems. This matters because existing reasoning methods may miss creative capabilities entirely.
diversity collapse may occur because existing methods explore only combinational creativity: explicitly prompting for exploratory and transformational paradigms could expand the generative range beyond the narrow high-novelty cluster
-
Can LLMs generate more novel ideas than human experts?
Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?
the mechanism underlying diversity collapse: inability to self-evaluate means models cannot recognize when they are iteratively sampling the same generative region; the dissociation explains why high individual novelty coexists with collective homogeneity
-
Why do LLMs excel at feasible design but struggle with novelty?
When LLMs generate conceptual product designs, they produce more implementable and useful solutions than humans but fewer novel ones. This explores why domain constraints flip the novelty advantage seen in research ideation.
domain inversion: diversity collapse occurs in both research ideation and conceptual design, but through opposite profiles; in research, high novelty with collapsed diversity; in design, high feasibility with collapsed novelty; the common mechanism is narrow generative range regardless of which quality dimension is optimized
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm research ideation suffers from diversity collapse despite high average novelty