Can human researchers improve LLM ideas through iterative feedback?

This explores whether a human-in-the-loop refinement process — researchers giving LLMs corrective feedback round after round — actually makes machine-generated research ideas better, and the corpus suggests the answer depends heavily on *how* that feedback is structured.

This reads the question as asking about the human-in-the-loop refinement loop: an LLM proposes ideas, a researcher critiques, the model revises, and so on. The corpus has a surprising amount to say here, and it cuts in two directions at once. Start with the raw material being refined. LLMs actually generate ideas rated *more* novel than expert ideas in blind comparisons, though slightly less feasible Do language models generate more novel research ideas than experts?. So there's something real for feedback to work on — the model isn't just echoing the obvious. But the catch surfaces the moment anyone tries to build on those ideas: when 43 experts spent 100+ hours executing randomly assigned ideas, the LLM ones degraded far more than human ones, revealing impractical evaluation designs and missing technical groundwork invisible at the brainstorming stage Do LLM research ideas actually hold up when experts try to execute them?. This is the strongest case *for* iterative human feedback: the weaknesses are exactly the kind execution reveals and a researcher could catch and correct early.

Here's the twist the reader probably doesn't expect — iterative feedback from a single researcher can quietly *degrade* validity rather than improve it. When one person keeps revising prompts to get better outputs, they import their own bias, drift their evaluation criteria to match whatever the model happens to be good at, and create self-fulfilling feedback loops where the model and the human converge on something that merely looks right Does iterative prompt engineering undermine scientific validity?. So 'human researcher gives feedback' is not automatically improvement; informal, unstructured iteration is part of the problem. The same note prescribes the fix: pre-specified criteria and inter-coder reliability instead of one person's evolving taste.

That fix theme repeats across the corpus. Structure beats vibes. A three-stage decomposition pipeline reached 86% alignment with human reviewers on novelty assessment, far outperforming a model just asked to judge holistically Can structured pipelines make LLM novelty assessment reliable? — evidence that feedback works best when the *task itself* is broken into checkable steps rather than delivered as a global thumbs-up or thumbs-down. And the writing process echoes this: research reports improve through draft-and-revise cycles modeled like diffusion, where a persistent skeleton is iteratively denoised through targeted retrieval, which holds global coherence better than a single linear pass Can iterative revision cycles match how humans actually write?. Iteration helps — but structured iteration, with a stable scaffold being refined, not free-form back-and-forth.

The lateral surprise is that some research is trying to cut humans out of the feedback loop entirely. MCTS-based self-improvement derives dense, process-level quality signals from tree search outcomes, matching human-labeled feedback without any annotation Can tree search replace human feedback in LLM training?. And from the other side, models can be trained to *actively solicit* corrective feedback through dialogue rather than passively receiving it — reframing tasks as pedagogical conversations where the model learns to extract what a knowledgeable teacher knows Can LLMs learn to ask for feedback during problem solving?. Read together, the corpus answer is: yes, human feedback can improve LLM ideas — the ideation-execution gap is precisely the gap feedback is positioned to close — but only when the loop is structured, criteria are fixed in advance, and the human resists the temptation to keep moving the goalposts until the output looks good.

Sources 7 notes

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Do LLM research ideas actually hold up when experts try to execute them?

When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.

Does iterative prompt engineering undermine scientific validity?

Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Can iterative revision cycles match how humans actually write?

Research writing follows a draft-and-revise pattern analogous to diffusion sampling, where a persistent draft skeleton is iteratively denoised through targeted retrieval steps. This architecture maintains global coherence better than linear pipelines while mirroring cognitive studies of actual human writing.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can LLMs learn to ask for feedback during problem solving?

Research shows that reformulating static tasks as pedagogical dialogues—where a teacher has privileged information and the student must learn to extract it—trains models to actively engage conversation as a problem-solving tool, not just imitate dialogue patterns.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about human-in-the-loop LLM idea refinement. The question remains: can iterative human feedback actually improve LLM-generated research ideas?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat them as perishable constraints to re-examine.
• LLM-generated ideas are rated statistically MORE novel than expert ideas in blind comparison, but degrade sharply during execution (ideation-execution gap: human ideas executed at 3–5× higher success rates) (2025).
• Single-researcher iterative feedback *degrades* validity: unstructured prompt revision introduces confirmation bias, moving goalposts, and drift in evaluation criteria (2024).
• Structured decomposition (3-stage novelty-assessment pipeline) reached 86% alignment with human reviewers; holistic feedback underperforms (2025).
• Draft-and-revise cycles work better as persistent-skeleton diffusion (targeted retrieval, global coherence) than linear iteration (2025).
• MCTS-based self-improvement and learnable feedback-solicitation bypass human annotation entirely, matching human-labeled signal quality (2024–2026).

Anchor papers (verify; mind their dates):
• arXiv:2409.04109 (2024-09): 100+ NLP researchers, novelty vs. execution gap.
• arXiv:2506.20803 (2025-06): Ideation-Execution Gap study, outcome measurement.
• arXiv:2602.16488 (2026-02): Social Meta-Learning, learnable feedback solicitation.
• arXiv:2508.10795 (2025-08): LLM-Assisted Scholarly Critique, structured feedback.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the ideation-execution gap: have newer model scales, chain-of-thought scaffolding, or domain-specific fine-tuning since closed this gap? For bias-in-iteration: does formal criterion-locking or multi-coder approval now prevent drift? For structured decomposition: do latest prompt-engineering methods (e.g., tree-of-thought, branching) beat the 86% baseline? Separate the durable question ("does structure help?") from perishable limits ("does 86% hold?").
(2) Surface the strongest contradicting or superseding work from the last ~6 months—especially any claiming unstructured feedback *does* work, or that self-improvement has fully replaced human loops.
(3) Propose 2 research questions assuming the regime may have shifted: e.g., "Do mixture-of-experts feedback (human + agent critics) outperform single-human iteration?" or "Can active learning reorder feedback proposals so humans spend less time correcting obvious gaps?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can human researchers improve LLM ideas through iterative feedback?

Sources 7 notes

Next inquiring lines