Language Understanding and Pragmatics Design & LLM Interaction Psychology and Social Cognition

Can LLMs generate more novel ideas than human experts?

Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?

Note · 2026-02-21 · sourced from Discourses

Two findings appear to conflict: LLM-generated research ideas are rated more novel than expert-generated ones (Si et al. 2024), yet LLM academic writing systematically avoids the evaluative and evidential nouns that characterize expert intellectual work — it prefers manner nouns (describing process) over status nouns (assessing claims) and evidential nouns (grounding in evidence). How can a system that avoids evaluative stance-taking produce ideas judged more novel than those of evaluating experts?

The resolution: generation and evaluation are dissociated cognitive operations, and LLMs are asymmetrically capable at them.

Generation: Combining existing concepts in new configurations. LLMs have combinatorial range that exceeds human disciplinary range — they are not anchored by domain priors or professional reputation costs. A human expert generates ideas constrained by what is tractable, publishable, and consistent with their existing commitments. LLMs face none of these constraints. The result is wider combinatorial reach, which produces higher novelty scores.

Evaluation: Assessing whether a generated idea is correct, feasible, important, or properly evidenced. This requires epistemic commitment — making a judgment call and defending it. Since Should we call LLM errors hallucinations or fabrications?, LLMs have no internal corrective mechanism — they cannot distinguish their accurate claims from their inaccurate ones using the same generative process. Evaluative stance-taking requires exactly this distinction.

The dissociation explains the feasibility gap: LLM ideas are more novel and less feasible. Novelty comes from unconstrained combinatorics; infeasibility comes from the absence of evaluation that would filter out the implausible combinations. Human experts generate fewer novel ideas because they self-evaluate more aggressively during generation.

This has implications for how to use LLMs in research workflows: they are combinatorial idea generators, not evaluators. The appropriate workflow pairs LLM generation with human evaluation, not LLM evaluation of LLM ideas.

Human approval becomes the structural bottleneck. The asymmetry has a workflow consequence beyond appropriate pairing: as AI-generated volume increases, evaluation — which remains on the human side — becomes the capacity-limiting step. AI generates faster than humans can evaluate. Since What collaboration level do workers actually want with AI?, the desired partnership shape aligns with this constraint: humans do not want to be sidelined, but they also cannot keep up if their role is saturated with approval work. The bottleneck shifts from production (where AI excels) to validation (which AI cannot do for itself), and the ergonomic consequence is that the human reviewer's cognitive load scales with AI throughput. Designing AI-augmented workflows that ignore this bottleneck produces a pipeline where volume accumulates faster than validation, and unvalidated output becomes the default rather than the exception.

Empirical closure: the ideation-execution gap. A large-scale execution study (N=43 experts, 100+ hours each, The Ideation-Execution Gap) provides direct evidence: when LLM-generated and human ideas are randomly assigned to expert implementers, "the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p<0.05), closing the gap between LLM and human ideas observed at the ideation stage." Execution reveals weaknesses invisible at ideation — missing baselines, impractical evaluation methods, poor generalizability. LLM ideas systematically propose evaluations requiring human expert recruitment that executors always change. See Do LLM research ideas actually hold up when experts try to execute them?.

The literary criticism case: Literary criticism is the domain where the ideation-evaluation dissociation is most consequential, because criticism requires both operations simultaneously. A critic must identify what a text does (the generative/recognition side — which devices are present, what patterns emerge) AND judge whether it succeeds (the evaluative side — does this metaphor work, does this structure serve the argument, is this ambiguity productive or merely confusing). LLMs can perform the first operation impressively — detecting rhetorical devices, extracting metaphoric mappings, identifying stylistic signatures. They cannot perform the second. Since Can LLMs truly understand literary meaning or just mechanics?, literary analysis is where the dissociation stops being an interesting theoretical observation and becomes a functional barrier.

Source: Discourses; enriched from inbox/research-brief-llm-literary-analysis-2026-03-02.md

Related concepts in this collection

Do language models generate more novel research ideas than experts? Explores whether LLMs can break free from expert constraints to generate more novel research concepts. Matters because novelty is often thought to be AI's creative blind spot.
the generation-side finding; this note explains why novelty without feasibility is the expected outcome
Why do ChatGPT essays lack evaluative depth despite grammatical strength? ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.
the evaluation-side finding; structurally coherent but evaluatively absent
Should we call LLM errors hallucinations or fabrications? Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.
grounds: no internal corrective mechanism means evaluation and generation are not coupled; what is generated is not assessed before output
Can imitating ChatGPT fool evaluators into thinking models improved? Explores whether fine-tuning weaker models on ChatGPT outputs creates an illusion of capability gains. Investigates why human raters and automated judges fail to detect that imitation improves style but not underlying factuality or reasoning.
practical consequence of the dissociation: imitation models capture the generative style (combinatorial fluency) while missing factual grounding (evaluative accuracy), because imitation training optimizes the generation side that LLMs are already good at
Does chatbot interaction trade authenticity for better problem-solving? When students solve problems with AI chatbots instead of peers, do they sacrifice personal voice and subjective expression in exchange for more efficient knowledge exchange and higher task performance?
the dissociation manifests in educational settings: chatbots provide efficient knowledge generation (the combinatorial side) but the absence of evaluative stance-taking means students stop articulating and defending their own positions, mirroring the generation-without-evaluation pattern
Can LLMs reason creatively beyond conventional problem-solving? Explores whether large language models can engage in truly creative reasoning that expands or redefines solution spaces, rather than just decomposing known problems. This matters because existing reasoning methods may miss creative capabilities entirely.
UoT's three-axis evaluation (feasibility + utility + novelty) directly addresses the evaluation gap: the dissociation means LLMs can generate across all three creative paradigms but cannot assess which outputs are feasible or useful without an external evaluative framework
Why do LLMs generate novel ideas from narrow ranges? LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
practical manifestation of evaluation dissociation: models cannot assess that they are repeatedly sampling from the same high-novelty cluster; self-evaluation failures prevent recognizing when diversity has collapsed, making the generation-without-evaluation pattern visible at the population level
Why do LLMs excel at feasible design but struggle with novelty? When LLMs generate conceptual product designs, they produce more implementable and useful solutions than humans but fewer novel ones. This explores why domain constraints flip the novelty advantage seen in research ideation.
domain inversion: in constrained design domains where evaluation criteria are embedded in the prompt (feasibility, usefulness ratings), models channel generation toward conservative solutions; the dissociation flips — evaluation constraints suppress novelty rather than being absent from it

Concept map

22 direct connections · 157 in 2-hop network ·medium cluster

Can LLMs generate more novel ideas than human ex… Do language models generate more novel research id… Why do ChatGPT essays lack evaluative depth despit… Should we call LLM errors hallucinations or fabric… Can imitating ChatGPT fool evaluators into thinkin… Does chatbot interaction trade authenticity for be… Can LLMs reason creatively beyond conventional pro… Why do LLMs generate novel ideas from narrow range… Why do LLMs excel at feasible design but struggle …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

llm ideation and evaluation are dissociated — combinatorial generation can exceed human novelty while evaluative stance-taking remains structurally absent