INQUIRING LINE

Which LLM backends produce the most executable research ideas?

This reads the question as 'which models reliably turn ideas into something that actually runs?' — but the corpus reframes it: the bottleneck isn't which backend, it's that the same models that ideate brilliantly tend to collapse at execution.


This explores which LLM backends produce ideas you can actually build — and the corpus's surprising answer is that the question of *which* backend matters far less than the gap every backend shares between dreaming up an idea and making it work. The most cited finding here is an ideation-execution gap: when 43 expert researchers spent 100+ hours actually implementing randomly assigned ideas, the LLM-generated ones dropped sharply across every metric, exposing impractical evaluation designs and missing technical groundwork that looked fine on paper Do LLM research ideas actually hold up when experts try to execute them?. So 'most executable' is a moving target — an idea that scores high at proposal time can be the one that falls apart in the lab.

Why does this happen so consistently? Because generation and evaluation appear to be *dissociated* capabilities in current models. LLMs produce more novel ideas than human experts precisely because they lack the disciplinary constraints that experts carry Do language models generate more novel research ideas than experts?, but that same unconstrained combination means they avoid taking the evaluative stance needed to judge whether an idea is feasible Can LLMs generate more novel ideas than human experts?. They can't reliably grade their own output — automated evaluation overestimates idea quality by around 60% Why do LLMs generate more novel research ideas than experts?. A backend that's better at novelty is not automatically better at executability; often it's the reverse.

There's a sharper, more mechanical version of your question hiding in the corpus: can any LLM produce a plan that runs? When researchers tested GPT-4 on actual planning, only 12% of generated plans were executable without errors — the models acquire planning *knowledge* fluently but fail at the reasoning assembly that handles subgoals and competing resource constraints Can large language models actually create executable plans?. That's the real ceiling on 'executable ideas': not vocabulary or creativity, but the step where pieces have to interlock without contradiction. One explanation for why this is so hard sits in how generation itself works — token prediction flows smoothly toward the training distribution rather than stress-testing competing positions, so claims multiply without the internal friction that catches infeasibility Does LLM generation explore competing claims while producing text?.

If you want to actually raise executability, the corpus points to architecture around the model, not the choice of model. Structured decomposition pipelines — extract the claims, retrieve related work, compare — reached ~86% alignment with human reviewers on novelty judgments, far outpacing a model asked to judge holistically Can structured pipelines make LLM novelty assessment reliable?. The lesson generalizes: you get more executable ideas by bolting on the evaluative scaffolding the model won't supply on its own, and by treating its proposals as a subjective prior to be weighted and verified rather than as findings to be trusted Should we treat LLM outputs as real empirical data?. And interestingly, the one place LLMs *do* outperform experts at a forward-looking task — predicting which neuroscience results actually occurred — is the same pattern-integration tendency that elsewhere produces hallucination Can LLMs predict novel scientific results better than experts?. So the thing that makes a backend feel generative and the thing that makes its ideas executable may be two faces of one mechanism — which is why you can't pick a winner by novelty score alone.


Sources 9 notes

Do LLM research ideas actually hold up when experts try to execute them?

When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Can LLMs generate more novel ideas than human experts?

LLMs produce more novel research ideas than experts because they lack disciplinary constraints, but they systematically avoid evaluative stance-taking required to assess feasibility or validity. Generation and evaluation are dissociated capabilities.

Why do LLMs generate more novel research ideas than experts?

Research shows LLM-generated ideas are statistically more novel than expert-produced ideas, but LLMs struggle to evaluate quality—automated evaluation overestimates by 60%. When executed, LLM ideas drop significantly on all metrics, suggesting novelty without feasibility.

Can large language models actually create executable plans?

Only 12% of GPT-4 generated plans are actually executable without errors. LLMs excel at acquiring planning knowledge but fail at the reasoning assembly required to handle subgoal and resource interactions.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Should we treat LLM outputs as real empirical data?

Foundation Priors framework shows that LLM-generated text reflects the model's learned patterns and user's prompt choices, not ground truth. Such outputs should only influence inference through explicitly parameterized trust weights, not be treated as equivalent to real evidence.

Can LLMs predict novel scientific results better than experts?

BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.

Next inquiring lines