Can LLMs replace search engines during agent training?
Explores whether LLMs possess sufficient internal knowledge to simulate search engines for RL training, potentially eliminating expensive API costs while maintaining training signal quality.
Two papers converge on the same principle from different angles: LLMs possess enough internal world knowledge to serve as their own search engines during RL training, eliminating the prohibitive API costs of real search engine interaction.
ZeroSearch addresses this architecturally. Lightweight SFT transforms a small LLM (3B-14B) into a retrieval module that generates both relevant and noisy documents in response to a query. The key advantage over real search: controllable document quality. By adjusting prompts, the simulator generates either helpful or misleading documents, enabling a curriculum rollout strategy that progressively degrades quality during training. The policy model first learns basic formats, then adapts to increasingly challenging retrieval scenarios.
The result is striking: a 7B retrieval module achieves comparable performance to a real search engine. A 14B module surpasses it. The LLM-simulated environment provides more stable and controllable training than noisy real-world search.
SSRL (Self-Search RL) approaches the same principle from the inference side. LLMs auto-regressively generate search queries, then generate relevant information to address them — the entire reasoning trajectory in a single forward pass. The internal knowledge scales with inference budget: pass@k performance improves substantially with sampling, achieving high accuracy on BrowseComp. RL further enhances this Self-Search capability through format-based and rule-based rewards.
The tension with Why do search agents beat memorized retrieval on hard questions? is real but conditional. Real-world search outperforms simulated search on tasks requiring temporal currency or rare knowledge. But for the majority of training iterations where the goal is learning search behavior (when to search, how to formulate queries, how to evaluate results), simulated search provides adequate signal at dramatically lower cost.
SSRL adds a surprising finding: thinking tokens are inefficient for search tasks. Long CoT does not improve Self-Search performance — contradicting the pattern seen in math reasoning. Search primarily requires knowledge retrieval, not extended deliberation. Short-CoT should be preferred to maximize token efficiency.
Source: Reasoning o1 o3 Search
Related concepts in this collection
-
Why do search agents beat memorized retrieval on hard questions?
Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?
the tension: real search for deployment, simulated search for training
-
Can prompt optimization teach models knowledge they lack?
Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
Self-Search is the extreme version: the model activates its own knowledge as search results
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
search/knowledge-retrieval is another task type where extended reasoning is inefficient
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llms can simulate search engines via internal knowledge eliminating api costs for rl training of search agents