SSRL: Self-Search Reinforcement Learning

Paper · arXiv 2508.10874 · Published August 14, 2025
Reasoning o1 o3 SearchReinforcement Learning

We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs’ Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.

Our iterative reasoning framework follows a structured process where the model first expresses its initial thoughts within ... tags. When the model identifies missing information necessary for solving the problem, it formulates search queries within ... tags. The model then auto-regressively generates relevant information to address these queries, which is incorporated within ... tags. This cycle of thinking, searching, and information gathering continues iteratively until the model arrives at a final answer. While this approach shares similarities with traditional multi-turn search systems, it fundamentally differs in its implementation: rather than conducting genuine iterative interactions with external systems, our method employs a Chain-of-Thought (Wei et al., 2023) process where the language model auto-regressively generates the entire reasoning trajectory in a single forward pass, including thoughts, search queries, and retrieved information. This design enables efficient self-contained search while maintaining the structured exploration benefits of iterative search processes.

Inefficient Utilization of Thinking Tokens. Qwen3 models support both “thinking” and “no thinking” modes (Yang et al., 2025b), allowing manual adjustment of the number of thinking tokens before the model produces a final answer. To investigate the influence of increasing thinking tokens in Self-Search settings, we conduct a comparative study evaluating the impact of whether enabling the thinking process during inference. We only count the token used out of ..., ..., and ... for comparison of thinking token. As presented in Figure 4, the results demonstrate that as the number of assigned tokens increases, long CoT reasoning doesn’t yield a better performance, contradictory to what is observed in complex math questions. This is probably attributed to that the solution to agentic search mainly relies on the usage of knowledge, either internal or external, rather than solely thinking. These findings indicate that short-CoT should be preferred in Self-Search settings to maximize token efficiency.