Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Paper · arXiv 2508.07976 · Published August 11, 2025
Deep ResearchReinforcement LearningReasoning o1 o3 SearchTool Computer Use

Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. ≤ 10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset.

Recent advances in LLM-based agents have demonstrated remarkable capabilities in solving complex, knowledge-intensive problems by leveraging single or multiple external tools [42, 45, 37]. Among these, search tools stand out as particularly critical, enabling agents to access vast external knowledge for enhanced problem-solving [26, 8, 27]. However, expert-level use of search requires advanced intelligence. For instance, consider the question “As of December 31, 2024, what were the numbers of gold, silver, and bronze medals won by China in the 2012 London Olympics?”.While seemingly straightforward, this query is indeed challenging due to conflicting answers online (e.g., “38 gold, 27 silver, 22 bronze” vs. “39 gold, 31 silver, 22 bronze”). A search agent must navigate noisy and conflicting answers from diverse sources, identify the root cause of conflicts as doping test disqualifications from official reports, and ultimately determine the correct answer. Challenging real-world tasks require the agent to resolve high uncertainty in input queries, generate precise search queries, analyze and extract key insights from massive data, resolve inconsistencies, and conduct in-depth exploration. We term this advanced capability "Search Intelligence".

Proprietary agents and models has already exhibit signs of complex search behaviors through largescale Reinforcement Learning (RL) training [1, 25]. However, open-source approaches for developing search agents still face significant limitations. A series of works employ Reinforcement Learning or Supervised Fine-Tuning approaches to incentivize tool-using capabilities [11, 30, 49, 33]. On the other hand, prompt-based LLM agents supported by open-source models could perform massive tool calls without training [18, 2]. However, in practice, we find that existing online RL approaches fail to incentivize complex and effective search strategies. We also find prompt-based LLM agents could fail due to the insufficient capabilities of the LLM, such as failing to precisely extract key information from noisy webpages and unable to verify wrong conclusions. More recently, some works further build up on prompt-based LLM agents, utilizing offline RL approaches to improve the prompt-based agents [32, 19]. However, this offline RL paradigm, has been shown to underperform online RL in a broader range of domains [43, 6, 31].

In reasoning tasks such as math and coding, online RL has enable the models to evolve complex behaviors through iterative refining the reasoning processes based on correctness feedback. [9, 22, 7],. This raises a critical question: How could online RL methods effectively unlock Search Intelligence in open-source agents?

We identify two critical obstacles hindering effective online RL training for search agents:

• Insufficient search turns limit complex strategy learning. Existing works, such as Search- R1 [11], artificially limit the number of search turns, e.g. ≤ 10 per trajectory, preventing the agent from exploring deeper search paths. However, complex queries often require multi-turn tool calls and multi-step reasoning, that could not be learned under strict turn limits.

• Lack of large-scale, high-quality question-answer (QA) pairs: RL training for reasoning tasks requires abundant, challenging, and correct QA pairs [3, 16, 46]. However, most existing open-source datasets for search agents are often outdated (e.g. HotpotQA), oversimplified, or too small, failing to stimulate complex search behaviors through RL [44, 17, 34].

To address these challenges, we introduce ASearcher, an open-source project to enable large-scale agentic RL training for search agents. Our contributions include:

• Long-horizon search via fully asynchronous agentic RL training. With a large turn limit in batch generation RL training systems [11, 30, 21, 35], long trajectories within a batch could easily lead to significant idle time, slowing down the whole training process. Building up on AReaL [7], our fully asynchronous system avoids long trajectories from blocking the training by decoupling trajectory execution from model updates. This allows relaxed turn limits (e.g., 128 turns/trajectory), enabling agents to explore deeper search paths without sacrificing training efficiency. Remarkably, our agent, ASearcher-Web-QwQ, achieves extreme long-horizon search, with tool calls exceeding 40 turns and generated tokens surpassing 150k during RL training.

• A scalable QA synthesis agent. We design an LLM-based agent that autonomously generates challenging, uncertain, and grounded QA pairs requiring multi-turn tool use.

Starting from seed questions, the agent iteratively fuzzes queries by obscuring key information, or injects external facts to increase complexity. Each constructed question undergoes multistage validation to ensure quality and difficulty. From 14k seed QAs, we generate 134k high-quality samples, with 25.6k requiring external tools for resolution.

Using ASearcher, we train agents equipped with search engines and browsers under two settings, RL training starting from base models (Qwen2.5-7B/14B), to demonstrate that our training pipeline incentivizes strong and generalizable search strategies, and fine-tuning a prompt-based agent empowered by a powerful LRM (QwQ-32B), to validate the scalability of our training pipeline in fine-tuning large-scale prompt-based LLM agents.

!Pasted image 20250813095334.png

!Pasted image 20250813095340.png