How can per-step decisions about knowledge retrieval improve reasoning over uniform policies?
This explores whether letting a model decide retrieval moment-by-moment during reasoning — pulling external knowledge only at the steps that need it — beats applying one fixed retrieval rule everywhere.
This explores whether letting a model decide retrieval moment-by-moment during reasoning — pulling external knowledge only at the steps that need it — beats applying one fixed retrieval rule everywhere. The corpus's clearest answer comes from framing retrieval as a sequence of choices rather than a switch you flip once. DeepRAG treats each reasoning step as a Markov Decision Process, learning at every step whether to consult external sources or trust what the model already knows; the payoff is a ~22% accuracy gain that comes as much from *not* retrieving when retrieval would only inject noise as from retrieving when it helps When should language models retrieve external knowledge versus use internal knowledge?. The lesson is counterintuitive: uniform 'always retrieve' policies don't just waste effort, they actively degrade reasoning by drowning good internal knowledge in irrelevant fetched text.
The same 'choose the structure per query' instinct shows up one level higher. StructRAG routes each query to a task-appropriate knowledge format — a table, a graph, an algorithm, a plain chunk — depending on what the question actually demands, and beats uniform retrieval by grounding the choice in cognitive-fit theory: different reasoning tasks fit different representations, so forcing one shape on all of them is a mismatch Can routing queries to task-matched structures improve RAG reasoning?. Per-step and per-query selectivity are the same idea applied at different granularities — match the retrieval action to the local demand instead of standardizing it.
There's a deeper reason selectivity wins, visible if you look at *which* steps matter. Work on RLVR finds that only about 20% of tokens are high-entropy 'forking points' where the model genuinely decides where reasoning goes — and training on just those matches full-gradient performance Do high-entropy tokens drive reasoning model improvements?. Retrieval decisions plausibly cluster at exactly these junctions: most steps are low-stakes continuations where external lookup adds nothing, while a few pivotal steps are where fresh knowledge changes the trajectory. A uniform policy spends equally on both; a per-step policy concentrates effort where the fork actually is. Graph-O1 makes this concrete in the retrieval setting itself — instead of ingesting a whole knowledge graph, it learns a step-by-step traversal policy with MCTS and RL, deciding which edge to follow next rather than reading everything Can learned traversal policies beat exhaustive graph reading?.
Selectivity also has to be budgeted, not just toggled. Agentic deep research shows search behaves like a test-time scaling axis with diminishing returns, so the question isn't only *whether* to retrieve at a step but *how much* budget to spend across steps Does search budget scale like reasoning tokens for answer quality?. And long-horizon research suffers when any single step over-spends: capping per-turn reasoning preserves the context window for later retrieval rounds, which is a per-step discipline rather than a global time limit Does limiting reasoning per turn improve multi-turn search quality?. There's a subtle trap worth naming, though — chain-of-thought reasoning degrades predictably off-distribution, producing fluent but invalid logic Does chain-of-thought reasoning actually generalize beyond training data?, so a per-step policy is only as trustworthy as the step-level judgments driving it. That's why generative step-wise judges that reason *about* each reasoning step outperform classifier-style scorers Can judges that reason about reasoning outperform classifier rewards?: good per-step retrieval needs good per-step evaluation to know which steps were actually pivotal.
The thread across all of this: the corpus keeps finding that one fixed policy applied uniformly is the wrong default, and that the real gains live in learning *where the decision points are* and acting differently at each one — whether the decision is retrieve-vs-recall, which structure to fetch, which graph edge to walk, or how much budget to burn before moving on.
Sources 8 notes
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Graph-O1 replaces whole-graph ingestion with step-by-step agentic navigation using Monte Carlo Tree Search and reinforcement learning. This approach fits within LLM context windows while learning domain-specific traversal policies, though it trades certainty about the full graph for decision-making under uncertainty.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.