R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models (LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledgeintensive questions, leading to inaccuracies and hallucinations. To address this, we propose R1-Searcher, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini. The code is available at https://github.com/RUCAIBox/R1-Searcher.
To address this issue, extensive research has focused on augmenting LLMs with external information sources (a.k.a., retrieval-augmented generation (RAG) [12, 13]). Early approaches emphasize specific prompting strategies to guide LLMs in iterative question decomposition, query generation, and sub-question answering [14, 15, 16]. While effective, these complex prompt designs may rely on closed-source LLMs for achieving optimal performance. Subsequent studies investigate to distill this capability into smaller LLMs through supervised fine-tuning (SFT) [17]. However, recent findings suggest that SFT-based distillation can cause models to memorize solution paths, limiting their generalization to novel scenarios [18]. Recent proposals include a test-time scaling method [11, 19], notably employing the Monte Carlo Tree Search (MCTS) framework to enhance solution-finding by expanding the search space during inference. Despite its promise, this approach incurs significant inference overhead, reducing its practicality for widespread use. Therefore, we propose integrating an external retrieval environment during training, enabling models to explore and learn to effectively utilize retrieval for problem-solving. This approach aims to incentivize the search capability in LLMs, thereby enhancing LLMs’ generalization and improving inference efficiency.
In this paper, we introduce R1-Searcher, a novel framework to enhance the RAG capabilities of LLMs with RL. Our core motivation is to incentivizing the search capability in LLMs via exploring with an external retrieval environment. To implement it, we design a two-stage, outcome-based RL approach, enabling the model to freely explore how to invoke an external retrieval system to acquire relevant knowledge during the reasoning process through a tailored reward design. Specifically, in the first stage, we employ the retrieve-reward to incentivize the model to conduct retrieval operations without considering the final answer accuracy. In this way, the LLMs can quickly learn the correctly retrieval invocation format. In the second stage, we further introduce the answer reward to encourage the model to learn to effectively utilize the external retrieval system to solve question correctly.
System Prompt for Data Selection
You are a helpful assistant. Given a question, you should answer it by first thinking about the reasoning process in the mind and then providing the final answer. The output format of reasoning process and final answer are enclosed within
System Prompt for Base Model
The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the final answer. The output format of reasoning process and final answer are enclosed within
Judge Prompt
Given a Question and its Golden Answer, verify whether the Predicted Answer is correct. The prediction is correct if it fully aligns with the meaning and key information of the Golden Answer. Respond with True if the prediction is correct and False otherwise.
Question:
Golden Answer:
Predicted Answer: