Reinforcement Learning for LLMs Knowledge Retrieval and RAG LLM Reasoning and Architecture

Can knowledge graphs generate training data for search agents?

Exploring whether synthesizing questions from knowledge graph random walks with entity blurring can create the hard-to-find training data needed to teach deep search agents to reason and search effectively.

Note · 2026-02-23 · sourced from Knowledge Graphs
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Deep search agents need training data featuring hard-to-find questions that require long-horizon reasoning and iterative search — but such data is naturally scarce on the internet. DeepDive addresses this by automatically synthesizing challenging questions from open knowledge graphs (KGs), exploiting three properties:

  1. Verifiability: KG entity-relation triples are inherently traceable and objective, ensuring answer correctness — unlike fully model-generated QA pairs
  2. Multi-hop structure: Random walks of varying lengths on the KG explicitly control reasoning depth, generating questions requiring multiple inference steps
  3. Reasoning controllability: Each entity node has multiple attributes (dates, names, locations) that can be selectively obscured, creating "blurry entities" that prevent shortcut solutions

The pipeline: perform random walks on the KG to extract long multi-hop paths → LLMs further obfuscate key cues → resulting QA pairs require models to iteratively reason, search, validate, and reflect before arriving at accurate answers. This creates questions that even domain experts would need hours to research.

Combined with end-to-end multi-turn RL, DeepDive-32B achieves 14.8% accuracy on BrowseComp (a hard-to-find information benchmark), setting a new open-source competitive result and outperforming larger agents and several strong proprietary baselines. Key findings: complex supervision and multi-turn RL jointly ground tool use; performance scales with tool-call budgets and parallel sampling; skills learned on hard problems transfer to simpler settings.

The broader principle: KGs are ideal substrates for training data synthesis because they encode the relational complexity that makes questions genuinely hard, while providing the ground truth that makes answers verifiable. This is a concrete realization of the curriculum data thesis.

This connects to:

DeepDive (2025) adds end-to-end multi-turn RL on top of KG-synthesized data. Using multi-turn GRPO where the LLM interacts with a web environment and receives rewards based on the final answer, DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. The key finding: multi-turn RL training improves deep search ability and enables test-time scaling of tool calls — the model learns to invoke search more effectively and more frequently as it reasons. This validates the KG-based data synthesis approach by showing it provides sufficient training signal for RL-based deep search agents. Source: Arxiv/Agentic Research.

Original note title

Knowledge graph random walks with entity blurring generate scalable hard-to-find training data for deep search agents