Design & LLM Interaction

Do LLM research ideas actually hold up when experts try to execute them?

Explores whether LLM-generated ideas maintain their apparent novelty advantage when expert researchers spend 100+ hours implementing them. Matters because ideation-stage evaluation may not capture real-world feasibility barriers.

Note · 2026-03-30 · sourced from Work Application Use Cases
How do you build domain expertise into general AI models?

The ideation novelty finding (Si et al. 2025) showed LLM-generated research ideas rated significantly more novel than human expert ideas. This execution study provides the empirical reality check: when 43 expert researchers each spend over 100 hours implementing randomly-assigned ideas and writing 4-page papers, the novelty advantage disappears.

Comparing review scores before and after execution, "the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p<0.05), closing the gap between LLM and human ideas observed at the ideation stage." For many metrics, there is a ranking flip where human ideas score higher after execution.

The mechanism is precise: execution imposes feasibility constraints that ideation evaluation cannot anticipate. "During execution, every single step has to be grounded in realistic execution constraints, which impose higher feasibility standards than the ideation stage." Reviewers discover weaknesses only visible through implementation — missing baselines, poor generalizability, impractical evaluation designs, high resource requirements. AI-generated ideas systematically propose evaluations requiring human expert recruitment that executors always change to save cost and time.

This resolves the tension between Can LLMs generate more novel ideas than human experts? and Why do LLMs excel at feasible design but struggle with novelty?. The ideation-evaluation dissociation IS the problem — LLMs generate novel-sounding ideas precisely because they lack the evaluative capacity to recognize execution barriers. Novelty at ideation is a property of description quality, not executability. Since Why do LLMs generate novel ideas from narrow ranges?, individual LLM ideas may be novel AND individually infeasible — the two findings compound rather than contradict.

The implication for AI-assisted research is that proxy evaluation (judging ideas without execution) systematically overestimates LLM contribution. "Objective metrics like feasibility and effectiveness are best judged via the actual execution outcomes rather than speculative judgment based on the ideas." This challenges any benchmark or evaluation that rates AI research capability without implementation.


Source: Work Application Use Cases

Related concepts in this collection

Concept map
14 direct connections · 113 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

LLM-generated research ideas suffer an ideation-execution gap — ideas rated as novel at ideation score significantly lower after expert execution on all metrics