Why do LLM-generated ideas score higher novelty yet lower feasibility than expert ideas?

This explores why LLM-generated research ideas reliably win on novelty but lose on feasibility — and what that split tells us about how these models actually 'think up' ideas.

This explores why LLM-generated ideas tend to score *more* novel but *less* feasible than expert ideas — and the corpus suggests the answer isn't one weakness but a structural split in how generation and judgment work. The cleanest framing is that ideation and evaluation are dissociated capabilities Can LLMs generate more novel ideas than human experts?. An LLM generates novelty precisely *because* it isn't carrying an expert's accumulated sense of what won't work. Expert knowledge constrains the search space — you don't propose the wild combination because you already know why it fails — so the same disciplinary judgment that produces feasibility also suppresses novelty Do language models generate more novel research ideas than experts?. The model, free of those guardrails, roams wider and lands on combinations an expert would have pruned. That's the source of the novelty, and the same absence is the source of the infeasibility.

The gap doesn't show up at the moment of ideation — it shows up at execution. When 43 expert researchers actually tried to implement randomly-assigned ideas over 100+ hours, the LLM ideas dropped sharply on every metric, revealing impractical evaluation designs and missing technical groundwork that were invisible on paper Do LLM research ideas actually hold up when experts try to execute them?. So 'feasibility' isn't really a property the model failed to optimize — it's a property that can only be discovered by trying, and the model never tries. Worse, the model can't catch its own weak ideas: automated evaluation overestimates quality by around 60%, because LLMs systematically avoid the evaluative, stance-taking work that feasibility judgment requires Why do LLMs generate more novel research ideas than experts?.

There's a deeper reason the model can't self-assess feasibility, and it's worth knowing: feasibility judgment is partly *social*. Whether an idea is workable depends on reputation, track record, and standing — the social world where expertise is built — and the model only ever sees text, not that world Can language models distinguish expert arguments from common assumptions?. An expert reading an idea implicitly asks 'who would have to be wrong for this to work, and have they been?' The model has no access to that ledger.

Here's the twist the corpus adds that you might not expect: the novelty is shallower than the headline suggests. LLM ideas are individually novel but collectively narrow — they cluster in a few generative regions rather than spreading across the conceptual map, a kind of diversity collapse Why do LLMs generate novel ideas from narrow ranges?. One proposed explanation is that genuine creativity comes in distinct modes — combinational, exploratory, transformational — and current models really only do the combinational kind, recombining known pieces rather than restructuring the space Can LLMs reason creatively beyond conventional problem-solving?. That fits a picture of generation as smooth probabilistic flow toward the training distribution rather than turbulent exploration of competing possibilities Does LLM generation explore competing claims while producing text?.

Two countercurrents keep this from being a simple 'humans win' story. In some settings the result flips entirely — LLMs produce *more* feasible, more useful conceptual designs but *less* novel ones, suggesting the novelty/feasibility tradeoff depends heavily on task and prompting rather than being fixed Why do LLMs excel at feasible design but struggle with novelty?. And the evaluation gap may be partly fixable: structured pipelines that decompose the judgment — extract claims, retrieve related work, compare — reach ~86% alignment with human reviewers, far better than asking a model to judge holistically Can structured pipelines make LLM novelty assessment reliable?. The dissociation between dreaming up ideas and vetting them, in other words, may be less an inherent limit than a missing piece of scaffolding.

Sources 10 notes

Can LLMs generate more novel ideas than human experts?

LLMs produce more novel research ideas than experts because they lack disciplinary constraints, but they systematically avoid evaluative stance-taking required to assess feasibility or validity. Generation and evaluation are dissociated capabilities.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Do LLM research ideas actually hold up when experts try to execute them?

When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.

Why do LLMs generate more novel research ideas than experts?

Research shows LLM-generated ideas are statistically more novel than expert-produced ideas, but LLMs struggle to evaluate quality—automated evaluation overestimates by 60%. When executed, LLM ideas drop significantly on all metrics, suggesting novelty without feasibility.

Can language models distinguish expert arguments from common assumptions?

LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.

Why do LLMs generate novel ideas from narrow ranges?

LLM-generated research ideas are rated individually novel but lack diversity, clustering in narrow generative regions. Combined with LLM self-evaluation failures, this limits the possibility space explored compared to human ideation across different conceptual territories.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Why do LLMs excel at feasible design but struggle with novelty?

Expert evaluation shows LLM-generated conceptual designs score higher on feasibility and usefulness but lower on novelty compared to crowdsourced human solutions. Few-shot learning further reduces diversity while improving quality alignment.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Why do LLM-generated ideas score higher novelty yet lower feasibility than expert ideas?

Sources 10 notes

Next inquiring lines