Can human researchers improve LLM ideas through iterative feedback?
This explores whether a human-in-the-loop refinement process — researchers giving LLMs corrective feedback round after round — actually makes machine-generated research ideas better, and the corpus suggests the answer depends heavily on *how* that feedback is structured.
This reads the question as asking about the human-in-the-loop refinement loop: an LLM proposes ideas, a researcher critiques, the model revises, and so on. The corpus has a surprising amount to say here, and it cuts in two directions at once. Start with the raw material being refined. LLMs actually generate ideas rated *more* novel than expert ideas in blind comparisons, though slightly less feasible Do language models generate more novel research ideas than experts?. So there's something real for feedback to work on — the model isn't just echoing the obvious. But the catch surfaces the moment anyone tries to build on those ideas: when 43 experts spent 100+ hours executing randomly assigned ideas, the LLM ones degraded far more than human ones, revealing impractical evaluation designs and missing technical groundwork invisible at the brainstorming stage Do LLM research ideas actually hold up when experts try to execute them?. This is the strongest case *for* iterative human feedback: the weaknesses are exactly the kind execution reveals and a researcher could catch and correct early.
Here's the twist the reader probably doesn't expect — iterative feedback from a single researcher can quietly *degrade* validity rather than improve it. When one person keeps revising prompts to get better outputs, they import their own bias, drift their evaluation criteria to match whatever the model happens to be good at, and create self-fulfilling feedback loops where the model and the human converge on something that merely looks right Does iterative prompt engineering undermine scientific validity?. So 'human researcher gives feedback' is not automatically improvement; informal, unstructured iteration is part of the problem. The same note prescribes the fix: pre-specified criteria and inter-coder reliability instead of one person's evolving taste.
That fix theme repeats across the corpus. Structure beats vibes. A three-stage decomposition pipeline reached 86% alignment with human reviewers on novelty assessment, far outperforming a model just asked to judge holistically Can structured pipelines make LLM novelty assessment reliable? — evidence that feedback works best when the *task itself* is broken into checkable steps rather than delivered as a global thumbs-up or thumbs-down. And the writing process echoes this: research reports improve through draft-and-revise cycles modeled like diffusion, where a persistent skeleton is iteratively denoised through targeted retrieval, which holds global coherence better than a single linear pass Can iterative revision cycles match how humans actually write?. Iteration helps — but structured iteration, with a stable scaffold being refined, not free-form back-and-forth.
The lateral surprise is that some research is trying to cut humans out of the feedback loop entirely. MCTS-based self-improvement derives dense, process-level quality signals from tree search outcomes, matching human-labeled feedback without any annotation Can tree search replace human feedback in LLM training?. And from the other side, models can be trained to *actively solicit* corrective feedback through dialogue rather than passively receiving it — reframing tasks as pedagogical conversations where the model learns to extract what a knowledgeable teacher knows Can LLMs learn to ask for feedback during problem solving?. Read together, the corpus answer is: yes, human feedback can improve LLM ideas — the ideation-execution gap is precisely the gap feedback is positioned to close — but only when the loop is structured, criteria are fixed in advance, and the human resists the temptation to keep moving the goalposts until the output looks good.
Sources 7 notes
A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.
When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.
Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.
A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.
Research writing follows a draft-and-revise pattern analogous to diffusion sampling, where a persistent draft skeleton is iteratively denoised through targeted retrieval steps. This architecture maintains global coherence better than linear pipelines while mirroring cognitive studies of actual human writing.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
Research shows that reformulating static tasks as pedagogical dialogues—where a teacher has privileged information and the student must learn to extract it—trains models to actively engage conversation as a problem-solving tool, not just imitate dialogue patterns.