Can retrieval strategies drive both draft refinement and new research question generation?

This explores whether one mechanism — using retrieval as a feedback loop — can serve two jobs at once: tightening an existing draft, and surfacing the new questions a researcher should ask next.

This reads the question as asking whether retrieval is just a fetch-and-fill step or something closer to an engine that both refines what you've written and tells you what to investigate next. The corpus suggests it can be both — and the bridge between the two jobs is the same insight: a partial draft is itself a signal about what's missing.

Start with refinement. One framing treats research writing as diffusion-style denoising: you hold a persistent draft skeleton and repeatedly improve it through targeted retrieval rather than writing top-to-bottom in one pass Can iterative revision cycles match how humans actually write?. Each retrieval step is aimed at a rough patch in the current draft, which keeps the whole thing globally coherent instead of locally patched. That's retrieval *driving* refinement — the draft's weak spots decide what gets pulled.

The more surprising half is question generation, and the key note is that a model's own partial answer reveals information needs the original query couldn't express Can a model's partial response guide what to retrieve next?. When you feed a generated response back in as the next retrieval query, you surface implicit gaps — which is functionally the same as generating a new, sharper sub-question. So the loop that refines a draft and the loop that proposes new lines of inquiry are mechanically the same loop, just read in two directions: the gap can be filled (refinement) or pursued (a new question). This is why systems that separate query *planning* from answer *synthesis* outperform flat ones on multi-hop work Do hierarchical retrieval architectures outperform flat ones on complex queries? — the planning component is precisely where 'what should I ask next' lives as a first-class step.

But the corpus also names the failure mode you'd worry about. If retrieval can generate new questions, it can also generate confident garbage: deep research agents fabricate examples and evidence to satisfy a demand for depth, accounting for a large share of their failures Why do deep research agents fabricate scholarly content?. The proposed guardrail is making generation earn its place — letting a system grow its own corpus from its outputs only when those outputs pass entailment, attribution, and novelty checks Can RAG systems safely learn from their own generated answers?, or refusing to answer at all when evidence is too thin Can RAG systems refuse to answer without reliable evidence?. Without that gate, a question-generating loop just compounds its own hallucinations.

Two deeper caveats reframe the whole thing. First, not every question wants the same retrieval — question *type* determines strategy, so a comparison or debate question needs aspect-specific retrieval while a factoid suits standard RAG Does question type determine the right retrieval strategy?. A loop that generates new questions had better also classify them. Second, retrieval works best when it's trained on whether documents actually *helped* the answer, not just whether they looked similar Can retrieval learn what actually helps answer questions? — which is exactly the signal a draft-refinement loop produces for free. The thing you didn't know you wanted to know: drafting and question-generation aren't two features to build separately. They're the forward and reverse readings of a single retrieval-feedback loop, and the corpus's open problem is governing it Where do retrieval systems fail and why? so it generates real questions instead of plausible fictions.

Sources 9 notes

Can iterative revision cycles match how humans actually write?

Research writing follows a draft-and-revise pattern analogous to diffusion sampling, where a persistent draft skeleton is iteratively denoised through targeted retrieval steps. This architecture maintains global coherence better than linear pipelines while mirroring cognitive studies of actual human writing.

Can a model's partial response guide what to retrieve next?

ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.

Why do deep research agents fabricate scholarly content?

Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

Can retrieval learn what actually helps answer questions?

CLaRa propagates generator loss back through continuous document representations, allowing retrievers to optimize for documents that actually improve answers rather than surface similarity. The gap between relevance and usefulness closes when retrieval receives direct feedback from generation success.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

Can retrieval strategies drive both draft refinement and new research question generation?

Sources 9 notes

Next inquiring lines