The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas

Paper · arXiv 2506.20803 · Published June 25, 2025

Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p<0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.1

A recent large-scale human study examined AI-generated ideas in a randomized, blinded comparison to human experts (Si et al., 2025) and found that LLM ideas are judged as significantly more novel than human ideas with higher average scores across novelty, excitement, and expected effectiveness. However, evaluating research ideas is difficult even for experts (Simsek et al., 2024), leaving open the question of whether these ideas would translate into better research outcomes.

During execution, every single step has to be grounded in realistic execution

constraints, which impose higher feasibility standards than the ideation stage. Moreover, objective

metrics like feasibility and effectiveness are best judged via the actual execution outcomes rather than

speculative judgment based on the ideas.

Our work provides the first quantitative, large-scale study of AI ideas after execution by performing

a large-scale execution study

Comparing the review scores of these ideas from the previous ideation evaluation and our new execution evaluation, we observe the ideation-execution gap of LLM-generated ideas: LLM ideas score much lower in the execution evaluation as compared to the ideation evaluation.

reviewers consider more comprehensive factors in the execution evaluation, uncovering previously overlooked weaknesses of LLM ideas

We begin by analyzing the types of changes made to the ideas by our execution participants.

This indicates that only a moderate number of changes are made to both human and AI ideas.

AI-generated ideas like to propose human evaluations by recruiting experts or native speakers to annotate a large set of model predictions, which are always changed by the executors to save cost and time.

(“not using the same metrics as other works to compare the efficacy of this method”) for the AI idea “Temporal Bias Decay Simulation”, which was previously overlooked in the ideation evaluation without observing the executed experiments. Moreover, empirical experiments inspire reviewers to notice additional weaknesses of the idea and the experiment design, such as missing baselines and ablations, high resource requirements, and poor generalizability, which are almost entirely overlooked during ideation evaluation. For example, one reviewer commented “lacks comparison with previous work: the method is only compared with the simplest baselines despite well-acknowledged benchmarks” for the AIgenerated idea “Contrastive Semantic Pivot Prompting” to criticize missing baselines; one reviewer noted “the experiments should not just be numbers, but also include discussion of why ACGP actually produced the results provided” for the AI-generated idea “Adaptive Confidence-Guided Prompting” to request more analysis; and one reviewer commented “the method is also very computationally expensive” for

A promising direction for future work is to develop more capable research agents that can autonomously implement research ideas at near-human levels of quality. Such agents could greatly improve the scalability of large-scale idea evaluation and experimentation by reducing the dependence on expert human labor.

Proxy Reward Models Executing research ideas is resource-intensive, requiring substantial human effort, time, and computational resources. One avenue to mitigate this cost is the development of proxy reward models – predictive models that can estimate the likely effectiveness an idea without requiring full implementation. These models could be trained on historical execution outcomes, for example, from papers with known empirical results (Wen et al., 2025); or leverage simulations of the execution environments. If successful, such models could serve various purposes, such as rapidly ranking and filtering generated ideas, and acting as reward functions in reinforcement learning pipelines for idea generation.

Execution Feedback Loop Another compelling direction is to build closed feedback loops where the outcomes of executed experiments inform iterative idea improvement. This could be achieved through training-free methods such as evolutionary search or self-refinement mechanisms, where generated ideas are mutated and selected based on past execution performance.