Reinforcement Learning for LLMs LLM Reasoning and Architecture Knowledge Retrieval and RAG

Why do search agents beat memorized retrieval on hard questions?

Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?

Note · 2026-02-21 · sourced from Deep Research

The DeepResearcher paper trains RL agents in live web search environments rather than simulated offline retrieval. The result: these agents outperform models fine-tuned on static knowledge on knowledge-intensive tasks. The mechanism is not that real-world RL produces a smarter reasoner — it is that real-world search bypasses the bottleneck that memorized retrieval creates.

Memorized knowledge has two failure modes that real-time search does not share. First, it is temporally bounded: anything that postdates training is simply absent. Second, it is probabilistically compressed: details that appear infrequently in training data are underrepresented or confabulated. Real-time search has neither constraint. When a query requires a specific fact from a recent paper or a niche domain, the search agent retrieves it rather than reconstructing it from training distribution.

This reframes what "knowledge-intensive" means for evaluation. A task that looks hard because it requires obscure facts is not testing reasoning ability — it is testing retrieval coverage. A model that scores poorly may reason perfectly well but have a knowledge gap. The DeepResearcher finding suggests the better benchmark design is to evaluate reasoning under conditions where retrieval is available, not reasoning alone.

The implication for deployment: model capability and retrieval access are substitutes, not complements, for factual tasks. Adding search to a mid-sized model may close the gap with a larger model that lacks search. The investment calculus shifts from training compute toward inference infrastructure.

UR2's difficulty-aware curriculum introduces a refinement: retrieval should be triggered selectively by query difficulty, not always. Easy questions can be answered from parametric knowledge; only hard questions warrant retrieval. This means parametric knowledge and external retrieval are not just substitutes at the system level — they are per-instance alternatives that a trained policy can select between. The per-instance switching policy further shifts the investment calculus toward smart retrieval routing rather than maximum retrieval coverage.

KG-synthesized training data for deep search agents: DeepDive demonstrates that the training data bottleneck for deep search agents — the scarcity of hard-to-find questions requiring long-horizon reasoning — can be solved by synthesizing questions from knowledge graphs. KG random walks of varying lengths control reasoning depth, while selective entity attribute blurring ("entity blurring") prevents shortcut solutions. Combined with multi-turn RL, DeepDive-32B achieves 14.8% on BrowseComp (hard-to-find information benchmark), setting a new open-source competitive result. The broader principle: KGs are ideal substrates for training data synthesis because they encode relational complexity while providing verifiable ground truth. See Can knowledge graphs generate training data for search agents?.


Source: Deep Research; enriched from Knowledge Graphs

Related concepts in this collection

Concept map
16 direct connections · 164 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

deep research agents outperform rl-finetuned models on knowledge-intensive tasks because they replace memorized retrieval with real-world search