Knowledge Retrieval and RAG

When should retrieval happen during model generation?

Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.

Note · 2026-02-22 · sourced from RAG
RAG How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The default RAG paradigm retrieves once before generation and never again. This works for short-form factoid questions where the information need is fully expressed in the query. It fails for long-form generation where information needs emerge as the text develops — you cannot know in advance what you will need to support page three of a document.

FLARE (Forward-Looking Active Retrieval) introduces a principled trigger: retrieve only when the model generates low-probability tokens. The assumption is that large language models are reasonably well-calibrated — low confidence signals genuine knowledge gaps rather than stylistic uncertainty. When the model starts guessing, it should look something up. When it is confident, retrieval would add noise.

The mechanism: generate a tentative next sentence, check token probabilities, retrieve if confidence falls below threshold, regenerate with retrieved context. The retrieval query is the tentative sentence itself — forward-looking rather than backward-looking. This "what I am about to say" framing captures future information needs better than "what I was asked."

The distinction between short-form and long-form generation matters architecturally. Short-form (factoid QA) has clear information needs explicit in the query — single retrieval is appropriate. Long-form (summaries, essays, reports) has evolving information needs that only become clear during generation — iterative retrieval is necessary. Treating both the same way is the failure mode of standard RAG.

The practical consequence: retrieval becomes a dynamic resource, not a fixed setup cost. Active retrieval systems naturally allocate more retrieval budget to uncertain passages and none to passages the model handles confidently. This aligns retrieval investment with actual knowledge gaps.

Step-level retrieval for reasoning chains (Search-o1): The active retrieval principle extends from long-form generation to step-wise reasoning. Search-o1 integrates an agentic search workflow into o1-like reasoning chains: when the model encounters knowledge uncertainty at any reasoning step, it generates a search query to retrieve external knowledge. Standard problem-level RAG does NOT address this — it retrieves once at the start, while knowledge needs vary step by step in complex reasoning. The frequency of uncertainty markers (e.g., "perhaps" averaging 30+ occurrences per reasoning chain) signals that knowledge gaps are pervasive in extended reasoning, not isolated. A separate Reason-in-Documents module filters retrieved content before injection, addressing the noise problem: raw retrieved documents are verbose and can disrupt reasoning coherence.


Source: RAG; enriched from Reasoning o1 o3 Search

Related concepts in this collection

Concept map
23 direct connections · 171 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

active retrieval should trigger on model uncertainty not at fixed intervals