When should retrieval happen during model generation?
Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
The default RAG paradigm retrieves once before generation and never again. This works for short-form factoid questions where the information need is fully expressed in the query. It fails for long-form generation where information needs emerge as the text develops — you cannot know in advance what you will need to support page three of a document.
FLARE (Forward-Looking Active Retrieval) introduces a principled trigger: retrieve only when the model generates low-probability tokens. The assumption is that large language models are reasonably well-calibrated — low confidence signals genuine knowledge gaps rather than stylistic uncertainty. When the model starts guessing, it should look something up. When it is confident, retrieval would add noise.
The mechanism: generate a tentative next sentence, check token probabilities, retrieve if confidence falls below threshold, regenerate with retrieved context. The retrieval query is the tentative sentence itself — forward-looking rather than backward-looking. This "what I am about to say" framing captures future information needs better than "what I was asked."
The distinction between short-form and long-form generation matters architecturally. Short-form (factoid QA) has clear information needs explicit in the query — single retrieval is appropriate. Long-form (summaries, essays, reports) has evolving information needs that only become clear during generation — iterative retrieval is necessary. Treating both the same way is the failure mode of standard RAG.
The practical consequence: retrieval becomes a dynamic resource, not a fixed setup cost. Active retrieval systems naturally allocate more retrieval budget to uncertain passages and none to passages the model handles confidently. This aligns retrieval investment with actual knowledge gaps.
Step-level retrieval for reasoning chains (Search-o1): The active retrieval principle extends from long-form generation to step-wise reasoning. Search-o1 integrates an agentic search workflow into o1-like reasoning chains: when the model encounters knowledge uncertainty at any reasoning step, it generates a search query to retrieve external knowledge. Standard problem-level RAG does NOT address this — it retrieves once at the start, while knowledge needs vary step by step in complex reasoning. The frequency of uncertainty markers (e.g., "perhaps" averaging 30+ occurrences per reasoning chain) signals that knowledge gaps are pervasive in extended reasoning, not isolated. A separate Reason-in-Documents module filters retrieved content before injection, addressing the noise problem: raw retrieved documents are verbose and can disrupt reasoning coherence.
Source: RAG; enriched from Reasoning o1 o3 Search
Related concepts in this collection
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
same adaptive allocation principle; here applied to retrieval rather than reasoning tokens
-
Does search budget scale like reasoning tokens for answer quality?
Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
search budget as an inference-compute axis; active retrieval is the trigger mechanism that determines where that budget goes
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
related failure: models that cannot say "I don't know" also cannot identify when to retrieve
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
active retrieval is the constructive response to detected knowledge gaps; overthinking is the pathological alternative when the model lacks a retrieval escape and spirals with its own reasoning instead
-
When should an agent actually stop and deliberate?
How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
same uncertainty-triggered adaptive compute principle at different granularity: FLARE triggers retrieval on low-confidence tokens, SAND triggers deliberation on inconsistent action samples
-
Do iterative refinement methods suffer from overthinking?
Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
iterative refinement fails partly because it re-reasons on the same information; uncertainty-triggered retrieval provides the missing ingredient by injecting new evidence when revision stalls
-
Can reasoning stay grounded without external feedback loops?
Explores whether language models can maintain accurate reasoning through their own internal chains of thought, or whether they need real-world feedback to avoid hallucination and error propagation.
FLARE refines ReAct's foundational interleaving principle: ReAct retrieves at every reasoning step unconditionally, while uncertainty-gated retrieval makes the trigger conditional on genuine knowledge gaps rather than mandatory at each step
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
both identify the cost of continuing computation past its useful threshold: FLARE gates retrieval on detected knowledge gaps rather than at fixed intervals; the overthinking note shows thinking tokens beyond the sweet spot harm accuracy; uncertainty-gating is the retrieval-level analog of the optimal thinking-token limit
-
Can models learn when to think versus respond quickly?
Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.
Thinkless and FLARE solve the same when-to-invest-compute problem at different levels: Thinkless decides response-level (think vs. short), FLARE decides retrieval-level (retrieve vs. not); both use model uncertainty as the trigger signal
-
Can uncertainty estimation replace complex adaptive retrieval?
Is a simpler approach using model confidence signals sufficient to decide when retrieval is needed, or do complex multi-call adaptive pipelines deliver meaningful benefits?
validates and extends the FLARE principle: calibrated token-probability uncertainty estimation is sufficient for retrieval trigger decisions and outperforms more complex adaptive pipelines
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
active retrieval should trigger on model uncertainty not at fixed intervals