DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

Paper · arXiv 2509.10446 · Published September 12, 2025

First, data-wise, most existing QA datasets usually feature relatively simple questions that do not reflect true “hard-to-find” cases. For example, questions in HotpotQA [Yang et al., 2018] can often be solved by searching for a few clear entities. In contrast, deep search questions such as those in BrowseComp usually involve multiple blurry entities, requiring long-horizon reasoning and deep search to reach the correct answer. Second, training-wise, how to effectively combine long-horizon reasoning with deep search tool use remains an open question. Even strong reasoning models such as DeepSeek-R1 [DeepSeek-AI et al., 2025] make only shallow tool calls and often suffer from hallucinations (see Figure 1 Left). In addition, existing search/browsing agents that integrate browsing tools are primarily designed to address direct search tasks. For example, systems like R1-Searcher [Song et al., 2025], ReSearch [Chen et al., 2025], and DeepResearcher [Zheng et al., 2025] are mainly trained and evaluated on datasets similar to HotpotQA, including 2WikiMultiHopQA [Ho et al., 2020], Bamboogle [Press et al., 2022b], and Musique [Trivedi et al., 2022].

To address these challenges, we present DeepDive to advance deep search agents. First, we introduce a strategy to automatically synthesize hard-to-find questions from open knowledge graphs (KGs). Second, we apply end-to-end multi-turn RL to enhance LLMs’ long-horizon reasoning with deep search.

we address the lack of difficulty in QA datasets by automatically constructing a deep search QA dataset from KGs, as they naturally support multi-hop connections and each entity has different attributes. By deliberately blurring some attributes of each entity during question construction, we create a form of “blurry entity”. We then perform random walks on the KG to extract long, multi-hop paths and use LLMs to further obfuscate key cues, making the QA pairs more challenging.

Language Model Reasoning. Recent research has shown that large language models (LLMs) can improve performance on math and code tasks by generating explicit reasoning traces during inference. [Guo et al., 2025, Team, 2025b] Reinforcement learning (RL) has emerged as a powerful training paradigm for reasoning, with methods such as self-correction using reward signals to refine internal reasoning [Kumar et al., 2024]. Proprietary models such as OpenAI’s o1 series demonstrate that increasing the length of the reasoning inference time alone can produce notable improvements in math, code, and scientific benchmarks. Building on this, DeepSeek-AI introduced DeepSeek-R1, an open-source LLM trained using large-scale RL with a novel Grouped PPO algorithm to directly optimize reasoning accuracy [Guo et al., 2025, Shao et al., 2024]. These models reason entirely through their own token generation without environmental feedback, producing verifiable step-by-step solutions with self-checking mechanisms. This research confirms that well-designed RL training can equip LLMs with sophisticated autonomous problem-solving abilities.

Building deep search agents requires training data that goes beyond conventional multi-hop QA. While datasets like HotpotQA involve predictable reasoning steps, true deep search agents should act like human researchers who iteratively search, filter, and synthesize scattered evidence from the web. This thus calls for complex, difficult, and hard-to-find questions that even domain experts need hours to search and solve. Such complex training data is critical for developing agents to handle real-world tasks where information is scattered, conflicting, and hard to locate.

However, the specific training data required to cultivate this skill is naturally scarce on the internet. With manual annotation being prohibitively expensive and difficult to scale, synthetic data generation emerges as the most efficient and scalable solution.

Knowledge Graphs with Hard-to-Find Information. Naturally, knowledge graphs (KGs) provide a structured and semantically-rich environment for multi-hop reasoning, making them particularly well-suited for generating supervision data for training deep search agents. First, verifiability: KGs encode factual entity-relation triples that are inherently traceable and objective, ensuring answer correctness and significantly improving data reliability compared to fully model-generated QA pairs. Second, multi-hop structure: KGs allow us to explicitly control reasoning depth by performing random walks of varying lengths, enabling the generation of questions requiring multiple inference steps. Third, reasoning controllability: each entity node contains multiple attributes that can be selectively obscured (such as dates, names, or locations), thereby increasing ambiguity and preventing models from exploiting shortcut solutions. Building data from KGs can force models to iteratively reason, search, validate, and reflect before arriving at accurate answers. In light of these advantages, we propose an automated KG-based method to generate scalable, high-quality, and reasoning-intensive QA pairs.

We present DeepDive that aligns deep reasoning with multi-turn web search through automated deep search QA synthesis and end-to-end multi-turn reinforcement learning. The data pipeline creates ambiguity-rich, multi-hop questions with hidden cues that match real long-horizon tasks. After the RL stage, DeepDive-32B attains 14.8% accuracy on BrowseComp, sets a new open-source competitive result, and outperforms larger agents and several strong proprietary baselines. Analyses show that complex supervision and multi-turn RL jointly ground tool use, that performance scales with tool-call budgets and parallel sampling, and that skills learned on hard problems transfer to simpler settings. We release datasets, models, and code to support progress toward open deep search systems.