Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can LLMs replace search engines during agent training?

Explores whether LLMs possess sufficient internal knowledge to simulate search engines for RL training, potentially eliminating expensive API costs while maintaining training signal quality.

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time? RAG

Two papers converge on the same principle from different angles: LLMs possess enough internal world knowledge to serve as their own search engines during RL training, eliminating the prohibitive API costs of real search engine interaction.

ZeroSearch addresses this architecturally. Lightweight SFT transforms a small LLM (3B-14B) into a retrieval module that generates both relevant and noisy documents in response to a query. The key advantage over real search: controllable document quality. By adjusting prompts, the simulator generates either helpful or misleading documents, enabling a curriculum rollout strategy that progressively degrades quality during training. The policy model first learns basic formats, then adapts to increasingly challenging retrieval scenarios.

The result is striking: a 7B retrieval module achieves comparable performance to a real search engine. A 14B module surpasses it. The LLM-simulated environment provides more stable and controllable training than noisy real-world search.

SSRL (Self-Search RL) approaches the same principle from the inference side. LLMs auto-regressively generate search queries, then generate relevant information to address them — the entire reasoning trajectory in a single forward pass. The internal knowledge scales with inference budget: pass@k performance improves substantially with sampling, achieving high accuracy on BrowseComp. RL further enhances this Self-Search capability through format-based and rule-based rewards.

The tension with Why do search agents beat memorized retrieval on hard questions? is real but conditional. Real-world search outperforms simulated search on tasks requiring temporal currency or rare knowledge. But for the majority of training iterations where the goal is learning search behavior (when to search, how to formulate queries, how to evaluate results), simulated search provides adequate signal at dramatically lower cost.

SSRL adds a surprising finding: thinking tokens are inefficient for search tasks. Long CoT does not improve Self-Search performance — contradicting the pattern seen in math reasoning. Search primarily requires knowledge retrieval, not extended deliberation. Short-CoT should be preferred to maximize token efficiency.


Source: Reasoning o1 o3 Search

Related concepts in this collection

Concept map
14 direct connections · 178 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llms can simulate search engines via internal knowledge eliminating api costs for rl training of search agents