Language Understanding and Pragmatics Design & LLM Interaction Psychology and Social Cognition

Can LLMs generate more novel ideas than human experts?

Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?

Note · 2026-02-21 · sourced from Discourses

Two findings appear to conflict: LLM-generated research ideas are rated more novel than expert-generated ones (Si et al. 2024), yet LLM academic writing systematically avoids the evaluative and evidential nouns that characterize expert intellectual work — it prefers manner nouns (describing process) over status nouns (assessing claims) and evidential nouns (grounding in evidence). How can a system that avoids evaluative stance-taking produce ideas judged more novel than those of evaluating experts?

The resolution: generation and evaluation are dissociated cognitive operations, and LLMs are asymmetrically capable at them.

Generation: Combining existing concepts in new configurations. LLMs have combinatorial range that exceeds human disciplinary range — they are not anchored by domain priors or professional reputation costs. A human expert generates ideas constrained by what is tractable, publishable, and consistent with their existing commitments. LLMs face none of these constraints. The result is wider combinatorial reach, which produces higher novelty scores.

Evaluation: Assessing whether a generated idea is correct, feasible, important, or properly evidenced. This requires epistemic commitment — making a judgment call and defending it. Since Should we call LLM errors hallucinations or fabrications?, LLMs have no internal corrective mechanism — they cannot distinguish their accurate claims from their inaccurate ones using the same generative process. Evaluative stance-taking requires exactly this distinction.

The dissociation explains the feasibility gap: LLM ideas are more novel and less feasible. Novelty comes from unconstrained combinatorics; infeasibility comes from the absence of evaluation that would filter out the implausible combinations. Human experts generate fewer novel ideas because they self-evaluate more aggressively during generation.

This has implications for how to use LLMs in research workflows: they are combinatorial idea generators, not evaluators. The appropriate workflow pairs LLM generation with human evaluation, not LLM evaluation of LLM ideas.

Human approval becomes the structural bottleneck. The asymmetry has a workflow consequence beyond appropriate pairing: as AI-generated volume increases, evaluation — which remains on the human side — becomes the capacity-limiting step. AI generates faster than humans can evaluate. Since What collaboration level do workers actually want with AI?, the desired partnership shape aligns with this constraint: humans do not want to be sidelined, but they also cannot keep up if their role is saturated with approval work. The bottleneck shifts from production (where AI excels) to validation (which AI cannot do for itself), and the ergonomic consequence is that the human reviewer's cognitive load scales with AI throughput. Designing AI-augmented workflows that ignore this bottleneck produces a pipeline where volume accumulates faster than validation, and unvalidated output becomes the default rather than the exception.

Empirical closure: the ideation-execution gap. A large-scale execution study (N=43 experts, 100+ hours each, The Ideation-Execution Gap) provides direct evidence: when LLM-generated and human ideas are randomly assigned to expert implementers, "the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p<0.05), closing the gap between LLM and human ideas observed at the ideation stage." Execution reveals weaknesses invisible at ideation — missing baselines, impractical evaluation methods, poor generalizability. LLM ideas systematically propose evaluations requiring human expert recruitment that executors always change. See Do LLM research ideas actually hold up when experts try to execute them?.

The literary criticism case: Literary criticism is the domain where the ideation-evaluation dissociation is most consequential, because criticism requires both operations simultaneously. A critic must identify what a text does (the generative/recognition side — which devices are present, what patterns emerge) AND judge whether it succeeds (the evaluative side — does this metaphor work, does this structure serve the argument, is this ambiguity productive or merely confusing). LLMs can perform the first operation impressively — detecting rhetorical devices, extracting metaphoric mappings, identifying stylistic signatures. They cannot perform the second. Since Can LLMs truly understand literary meaning or just mechanics?, literary analysis is where the dissociation stops being an interesting theoretical observation and becomes a functional barrier.


Source: Discourses; enriched from inbox/research-brief-llm-literary-analysis-2026-03-02.md

Related concepts in this collection

Concept map
22 direct connections · 157 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llm ideation and evaluation are dissociated — combinatorial generation can exceed human novelty while evaluative stance-taking remains structurally absent