Do LLM research ideas actually hold up when experts try to execute them?
Explores whether LLM-generated ideas maintain their apparent novelty advantage when expert researchers spend 100+ hours implementing them. Matters because ideation-stage evaluation may not capture real-world feasibility barriers.
The ideation novelty finding (Si et al. 2025) showed LLM-generated research ideas rated significantly more novel than human expert ideas. This execution study provides the empirical reality check: when 43 expert researchers each spend over 100 hours implementing randomly-assigned ideas and writing 4-page papers, the novelty advantage disappears.
Comparing review scores before and after execution, "the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p<0.05), closing the gap between LLM and human ideas observed at the ideation stage." For many metrics, there is a ranking flip where human ideas score higher after execution.
The mechanism is precise: execution imposes feasibility constraints that ideation evaluation cannot anticipate. "During execution, every single step has to be grounded in realistic execution constraints, which impose higher feasibility standards than the ideation stage." Reviewers discover weaknesses only visible through implementation — missing baselines, poor generalizability, impractical evaluation designs, high resource requirements. AI-generated ideas systematically propose evaluations requiring human expert recruitment that executors always change to save cost and time.
This resolves the tension between Can LLMs generate more novel ideas than human experts? and Why do LLMs excel at feasible design but struggle with novelty?. The ideation-evaluation dissociation IS the problem — LLMs generate novel-sounding ideas precisely because they lack the evaluative capacity to recognize execution barriers. Novelty at ideation is a property of description quality, not executability. Since Why do LLMs generate novel ideas from narrow ranges?, individual LLM ideas may be novel AND individually infeasible — the two findings compound rather than contradict.
The implication for AI-assisted research is that proxy evaluation (judging ideas without execution) systematically overestimates LLM contribution. "Objective metrics like feasibility and effectiveness are best judged via the actual execution outcomes rather than speculative judgment based on the ideas." This challenges any benchmark or evaluation that rates AI research capability without implementation.
Source: Work Application Use Cases
Related concepts in this collection
-
Can LLMs generate more novel ideas than human experts?
Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?
the ideation-execution gap is the empirical consequence of this dissociation
-
Why do LLMs generate novel ideas from narrow ranges?
LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
compounds: individually novel + collectively homogeneous + individually infeasible
-
Why do LLMs excel at feasible design but struggle with novelty?
When LLMs generate conceptual product designs, they produce more implementable and useful solutions than humans but fewer novel ones. This explores why domain constraints flip the novelty advantage seen in research ideation.
domain inversion: research ideas are novel-not-feasible while design solutions are feasible-not-novel
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
LLM-generated research ideas suffer an ideation-execution gap — ideas rated as novel at ideation score significantly lower after expert execution on all metrics