Deep Research: A Systematic Survey

Paper · arXiv 2512.02038 · Published November 24, 2025

Abstract: Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, which are beyond single-shot prompting or standard retrieval-augmented generation. Recently, numerous studies have explored Deep Research (DR), which aims to combine the reasoning capabilities of LLMs with external tools, such as search engines, thereby empowering LLMs to act as research agents capable of completing complex, open-ended tasks. This survey presents a comprehensive and systematic overview of deep research systems, including a clear roadmap, foundational components, practical implementation techniques, important challenges, and future directions. Specifically, our main contributions are as follows: (i) we formalize a three-stage roadmap and distinguish deep research from related paradigms; (ii) we introduce four key components: query planning, information acquisition, memory management, and answer generation, each paired with fine-grained sub-taxonomies; (iii) we summarize optimization techniques, including prompting, supervised fine-tuning, and agentic reinforcement learning; and (iv) we consolidate evaluation criteria and open challenges, aiming to guide and facilitate future development.

we propose a three-stage roadmap for DR systems, illustrating their broad applications ranging from agentic information seeking to autonomous scientific discovery. Based on the roadmap, we summarize the key components of the task-solving workflow for themost commonly used DR systems. Specifically, we present four foundational components in DR: (i) query planning, which decomposes the initially input query into a series of simpler, sub-queries [250, 426]; (ii) information acquisition, which invokes external retrieval, web browsing, or various tools on demand [167, 221]; (iii) memory management, which ensures relevant task-solving context through controlled updating or folding [243]; (iv) answer generation, which produces comprehensive outputs with explicit source attribution, e.g., a scientific report.

Phase II: Integrated Research. Phase II systems move beyond isolated facts to produce coherent, structured reports that integrate heterogeneous evidence while managing conflicts and uncertainty. The research loop becomes explicitly iterative: systems plan sub-questions, retrieve and extract key evidence from various raw content (e.g., HTML [323], tables [44, 226], and charts [208, 208]), and ultimately synthesize comprehensive, narrative reports. The most commonly-used applications include market and competitive analysis [469, 347], policy briefs [356], itinerary design under constraints [331], and other long-horizon question answering [66, 434, 378, 49]. Accordingly, evaluation shifts from superficial short-form lexical matching to long-form quality, including: finegrained factuality [43, 216], verified citations [310, 86], structural coherence [21], key points coverage [379]. Phase II thus trades a modest increase in compute and complexity for substantial gains in clarity, coverage, and decision support.

3.1. Query Planning

Query Planning refers to the process of transforming a complex and logically intricate question into a structured sequence of executable sub-queries (aka., sub-tasks), each of which can be addressed incrementally. This decomposition allows stepwise reasoning and knowledge acquisition, thereby enhancing the reliability and accuracy of the final output generated by deep research system. Figure 3 shows three widely-used strategies for query planning: (i) parallel planning, which decomposes the input into independent sub-queries that may be resolved in parallel [36, 59]; (ii) sequential planning, which arranges sub-queries into a linear order where each step depends on intermediate outcomes [286, 145]; and (iii) tree-based planning, which explores branching decision spaces and selects among candidate paths through pruning, backtracking, or heuristic-guided search [427].

3.1.1. Parallel Planning

Definition. As illustrated in Figure 3(a), parallel planning operates by rewriting or decomposing the original query into multiple sub-questions in a single pass, typically without iterative interaction with downstream components. The primary advantage of this strategy lies in its efficiency: simultaneous generation enables parallel processing of sub-queries.

Advantages & Disadvantages. Despite their efficiency, parallel planning has two primary limitations. First, they typically operate in a one-shot fashion, interacting with other modules (e.g., retriever, reasoner, aggregator) non-iteratively. As a result, they lack mechanisms to incorporate intermediate evidence, correct earlier decisions, or adaptively allocate computational resources. Second, they often ignore data and logical dependencies across sub-queries. Parallel execution assumes conditional independence, yet many real-world queries involve sequential reasoning in which later subtasks depend on the resolution of earlier ones.

3.1.2. Sequential Planning

Definition. As illustrated in Figure 3(b), the sequential planning decomposes the original query through multiple iterative steps, where each round of decomposition builds upon the outputs of previous rounds. At each stage, the sequential planning may invoke different modules or external tools to process intermediate results, enabling a dynamic, feedback-driven reasoning process. This multi-turn interaction allows the sequential planning to perform logically dependent query decompositions that are often intractable for pre-processing planning, which typically assumes conditional independence among sub-queries. By incorporating intermediate evidence and adapting the query trajectory accordingly, sequential planning is particularly well-suited for complex tasks that require stepwise inference, disambiguation, or progressive information gathering.

3.1.3. Tree-based Planning

Definition. As illustrated in Figure 3(c), the tree-based planning integrates features of both parallel and sequential planning by recursively treating each sub-query as a node within a structured search space, typically represented as a tree or a directed acyclic graph (DAG) [51]. This structure enables the use of advanced search algorithms, such as Monte Carlo Tree Search (MCTS) [20], to explore and refine potential reasoning paths. Compared to linear or flat decompositions, this approach supports more flexible and fine-grained decomposition of the original query, facilitating comprehensive knowledge acquisition.

Text Retrieval. Conceptually, modern text retrieval can be organized into three families: (i) lexical retrieval, (ii) semantic retrieval, and (iii) commercial web search. Lexical and semantic retrieval are typically implemented on local resources, while commercial web search is typically accessed only via paid APIs.

Different from the lexical retrieval, semantic retrieval refers to dense neural methods that encode queries and documents into continuous vector spaces to capture semantic similarity beyond exact term matching [283, 111, 284, 144], which has been widely adopted in recent works [145, 289].

Confidence Estimation as a Proxy for Boundary Perception. There are extensive works that investigate LLMs’ perception of their knowledge boundaries. The degree to which a model perceives its boundaries is typically measured by the alignment between its confidence and factual correctness. Since factual correctness is typically evaluated by comparing the model’s generated answer with the ground-truth answer, existing studies focus on how to measure the model’s confidence, which can be broadly divided into four categories.

• Probabilistic Confidence. This line of work treats a model’s token-level generation probabilities as its confidence in the answer [104, 58, 137, 153, 295, 164, 69]. Prior to the emergence of LLMs, a line of work had already shown that neural networks tend to be poorly calibrated, often producing overconfident predictions even when incorrect [104, 58, 137]. More recently, some research[153, 295] reported that LLMs can be well calibrated on structured tasks such as multi-choice question answering or appropriate prompts, but for open-ended generation tasks, predicted probabilities still diverge from actual correctness. To address this gap, Duan et al. [69] proposed SAR, which computes confidence by focusing on important tokens, while Kuhn et al. [164] introduced semantic uncertainty, which estimates confidence from the consistency of outputs across multiple generations.

• Consistency-based Confidence. Since probabilistic confidence often fails to capture a model’s semantic certainty and is inapplicable to black-box models without accessible generation probabilities, recent works represent confidence via semantic consistency across multiple responses [78, 207, 164, 451, 60]. The key idea is that a confident model should generate highly consistent answers across runs. Fomicheva et al. [78] first measured consistency through lexical similarity, while later studies used NLI (i.e., natural language inference) models or LLMs to assess semantic consistency [207, 164]. To address the issue of consistent but incorrect answers, Zhang et al. [451] measure consistency across different models, as incorrect answers tend to vary between models, whereas correct ones align. Ding et al. [60] further extended this idea to multilingual settings.

• Confidence Estimation Based on Internal States. LLMs’ internal states have been shown to capture the factuality of their generated content [10, 309, 28, 364, 230, 229]. Azaria and Mitchell [10] first discovered that internal states can signal models’ judgment of textual factuality. Subsequent studies [309, 28] found that internal states after response generation reflect the factuality of self-produced answers. More recently, Wang et al. [364] and Ni et al. [230] demonstrated that factuality-related signals already exist in the pre-generation states, enabling the prediction of whether the output will be correct.

• Verbalized Confidence. Several studies explore enabling LLMs to express confidence in natural language, akin to humans, viewing such verbalization as a sign of intelligence [185, 431, 340, 409, 450, 424, 228]. Yin et al. [431] and Ni et al. [228] examined whether LLMs can identify unanswerable questions, finding partial ability but persistent overconfidence. Other works [340, 409] investigated fine-grained confidence expression. Xiong et al. [409] offered the first comprehensive study for black-box models, while Tian et al. [340] proposed generating multiple answers per pass for more accurate estimation. Beyond prompting, some methods explicitly train models to verbalize confidence [185, 424, 450], with Lin et al. [185] introducing this idea and using correctness-based supervision.

Representative Adaptive Retrieval Approaches. Deep research systems typically involve iterative interactions between model inference and external document retrieval, differing mainly in how they determine when to retrieve. Early works such as IR-CoT [343] enforce retrieval after every reasoning step, ensuring continual grounding in external knowledge but at the cost of efficiency. Building on insights from studies of models’ perceptions of their own knowledge boundaries, recent approaches treat retrieval as a model-issued action, enabling the model to perform it dynamically only when needed. Similar to techniques in confidence estimation, these methods assess whether the model can answer a question correctly given the current context and perform retrieval when knowledge is deemed insufficient. They can be broadly categorized into four paradigms.

• Probabilistic Strategy. It triggers retrieval based on token-generation probabilities: when the model produces a token with low confidence, retrieval is initiated [138, 308].

• Consistency-based Strategy. Recognizing that both token-level probabilities and single-model self-consistency may fail to capture true semantic uncertainty, Rowen [60] evaluates consistency across responses generated by multiple models and languages, triggering retrieval when crossmodel or cross-lingual agreement is low.

• Internal States Probing. CtrlA [126], UAR [40], and SEAKR [429] further propose that compared to generated responses, a model’s internal states provide a more faithful reflection of its confidence, using them to guide adaptive retrieval decisions.

• Verbalized Strategy. It enables the model to directly express its confidence via natural language. These methods typically generate special tokens directly in the response to indicate the need for retrieval. ReAct [428] directly prompts the model to generate corresponding action text when retrieval is needed. Self-RAG [9] trains the model to explicitly express uncertainty through the special token (i.e., ), signaling the need for retrieval. With LLMs’ growing reasoning capacity, recent research has shifted toward determining retrieval timing through reasoning and reflection. Search-o1 [182] introduces a Reason-in-Documents module, which prompts the model to selectively invoke search during reasoning. Search-R1 [145] further frames retrieval as part of the environment and employs reinforcement learning to jointly optimize both when and what to retrieve.

Document Selection. Document selection aims to rank a set of candidate documents based on their relevance and usefulness to the query, selecting the top-k helpful documents for question answering [410, 439, 381]. This selection operation reduces the impact of noisy documents on LLMs, improving the question-answering accuracy in downstream tasks. Below, we review three document selection strategies: point-wise selection, pair-wise selection, and list-wise selection.

Content Compression. Content Compression aims to remove redundant or irrelevant information from retrieved knowledge, thereby increasing the density of useful content within the model’s context. Existing approaches primarily fall into two categories: lexical-based and embedding-based methods.

3.3. Memory Management

Definition. Memory management is a foundational component of advanced DR architectures, which governs the dynamic lifecycle of context used by DR agents in complex, long-horizon tasks [398, 67, 136], aiming to maintain coherent and relevant task-solving context [113, 462, 319].

Core Operation. As illustrated in Figure 5, memory management typically involves four core operations: consolidation, indexing, updating, and forgetting. Consolidation converts short-term experiences into durable representations that form the basis for later indexing. Indexing organizes these representations into retrieval structures that support efficient recall during problem solving. Updating refines or corrects stored knowledge, whereas forgetting selectively removes outdated or irrelevant content to reduce interference. In the following sections, we discuss consolidation, indexing, updating, and forgetting in detail.

Memory consolidation involves transforming interaction histories into durable formats, including but not limited to model parameters [370], structured graphs [474], or knowledge bases [197, 67]. Distinct from memory indexing, which creates navigable access pathways over existing memories, consolidation is fundamentally concerned with the initial transformation and structural organization of raw experience. Two primary paradigms for this process have emerged: (i) unstructured memory consolidation and (ii) structured memory consolidation.

3.3.1. Memory Consolidation

Definition. Memory consolidation is the process of transforming transient, short-term information, such as user dialogues or tool execution outputs, into stable, long-term representations [303, 67, 398]. Drawing an analogy to cognitive neuroscience, this process encodes and abstracts raw inputs to create durable memory engrams, laying the groundwork for efficient long-term storage and retrieval [398]. Memory consolidation involves transforming interaction histories into durable formats, including but not limited to model parameters [370], structured graphs [474], or knowledge bases [197, 67]. Distinct from memory indexing, which creates navigable access pathways over existing memories, consolidation is fundamentally concerned with the initial transformation and structural organization of raw experience. Two primary paradigms for this process have emerged: (i) unstructured memory consolidation and (ii) structured memory consolidation.

Unstructured Memory Consolidation. This paradigm distills lengthy interaction histories or raw texts into high-level, concise summaries or key event logs. For example, MemoryBank [482] processes and distills conversations into a high-level summary of daily events, which helps in constructing a long-term user profile. Similarly, MemoChat [197] summarizes conversation segments by abstracting the main topics discussed, while ChatGPT-RSum [358] adopts a recursive summarization strategy to manage extended conversations. Other approaches focus on abstracting experiences; Generative Agents [245] utilize a reflection mechanism triggered by sufficient event accumulation to generate more abstract thoughts as new, consolidated memories. To create generalizable plans, GITM [501] summarizes key actions from multiple successful plans into a common reference memory.

Structured Memory Consolidation. This paradigm transforms unstructured information into highly organized formats such as databases, graphs, or trees. This structural encoding is the primary act of consolidation, designed to capture complex inter-entity relationships and create an organized memory corpus. For instance, TiM [187] extracts entity relationships from raw information and stores them as tuples in a structured database. ChatDB [119] leverages a database as a form of symbolic memory, transforming raw inputs into a queryable, relational format. AriGraph [6] implements a memory graph where knowledge is represented as vertices and their interconnections as edges. Similarly, HippoRAG [142] constructs knowledge graphs over entities, phrases, and summaries to form an interconnected web of fragmented knowledge units. MemTree [268] builds and updates a tree structure by traversing from the root and deciding whether to deepen the tree with new information or create new leaf nodes based on semantic similarity. This hierarchical organization is the core of its consolidation strategy, enabling structured storage of memories.

3.3.2. Memory Indexing

Definition. Memory indexing involves constructing a navigational map over a DR agent’s consolidated memories, analogous to a library’s catalog or a book’s index for efficient information retrieval [204]. Unlike memory consolidation, which focuses on the initial transformation of raw data into a durable format, indexing operates on already consolidated memories to create efficient, semantically rich retrieval pathways. This process builds auxiliary access structures that enhance retrieval not only in efficiency but also in relevance.

• Signal-enhanced Indexing. This paradigm augments consolidated memory entries with auxiliary metadata, including emotional context, topics, and keywords, which function as granular pivots for context-aware retrieval [312, 448]. For instance, LongMemEval [390] enhances memory keys by integrating temporal and semantic signals to improve retrieval precision. Similarly, the Multiple Memory System (MMS) [448] decomposes experiences into discrete components, such as cognitive perspectives and semantic facts, thereby facilitating multifaceted retrieval strategies.

• Graph-based Indexing. This paradigm leverages a graph structure, where memories are nodes and their relationships are edges, as a sophisticated index. By representing memory networks in this way, agents can perform complex multi-hop reasoning by traversing chains of connections to locate information that is not explicitly linked to the initial query [46, 194]. For instance, HippoRAG [142] uses lightweight knowledge graphs to explicitly model inter-memory relations, enabling structured, interpretable access. A-Mem [414] adopts a dynamic strategy where the agent autonomously links related memory notes, progressively growing a flexible access network.

• Timeline-based Indexing. This paradigm creates a temporal index by organizing memory entries along chronological or causal sequences. Such structuring provides a historical access pathway, which is essential for understanding progression, maintaining conversational coherence, and supporting lifelong learning [353].

3.3.3. Memory Updating

Definition. Memory updating is a core capability of DR agents, involving the reactivation and modification of existing knowledge in response to new information or environmental feedback [361, 321, 369]. This process is essential for maintaining the consistency, accuracy, and relevance of the agent’s internal world model, thereby enabling continual learning and adaptive behavior in dynamic environments [349, 357].

3.4.1. Integrating Upstream Information

Definition. The main principle of trustworthy answer generation is to ensure that every statement is grounded in verifiable external evidence. Thus, the first stage of answer generation is integrating information from its upstream components, including: the sub-queries from the query planning, the ranked and potentially conflicting evidence, and the evolving contextual state stored in memory.

Resolving Conflicting Evidence. Research queries frequently surface contradictory sources, requiring the model to discriminate among varying levels of reliability. Building on fact-verification paradigms [339], recent systems adopt three major strategies.

• Credibility-Aware Attention: Instead of treating all retrieved information equally, this approach intelligently weighs evidence based on its source. The system assigns a higher score to information coming from more credible sources (e.g., a top-tier scientific journal) compared to less reliable ones (e.g., an unverified blog) [56]. This allows the model to prioritize trustworthy information while still considering relevant insights from a wider range of sources [94].

• Multi-Agent Deliberation: This strategy simulates an expert committee meeting to debate the evidence. Frameworks like MADAM-RAG [350] employ multiple independent AI agents, each tasked with analyzing the retrieved documents from a different perspective. Each agent forms its own assessment and conclusion. Afterwards, a final meta-reasoning step synthesizes these diverse viewpoints to forge a more robust and nuanced final answer, much like a panel of experts reaching a consensus [351].

• Reinforcement Learning for Factuality: This method trains the generator through a trial-anderror process that rewards factual accuracy [313]. A representative approach is RioRAG [372], in which an LLM receives a positive reward when it generates statements that are strongly and consistently supported by the provided evidence. Conversely, it is penalized for making unsubstantiated claims or statements that contradict the source material, shaping the model to inherently prefer generating factually grounded and reliable answers.

3.4.3. Structuring Reasoning and Narrative

The research community’s focus is shifting from the mere factual accuracy of DR systems to the crucial need for explainability and logical rigor in their answers. An opaque answer, which prevents users from tracing the underlying reasoning process, has significantly diminished utility in critical domains like scientific research [116, 201, 283]. Consequently, a significant line of work has emerged to enable models to generate structured reasoning processes rather than just monolithic final answers [376, 484, 99]. This trend is reflected in the design of most modern DeepResearch systems, which increasingly favor the explicit presentation of this structural information [418, 486].

Anthropic proposes a multi-agent Deep Research (DR) framework where a lead orchestrator coordinates multiple worker agents through structured, auditable interactions. The system transforms an open-ended research query into a complete workflow, from planning and delegation to synthesis and citation, under an explicit research budget controlling agent count, tool usage, and reasoning depth. We highlight several core points that enable the system’s efficiency and reliability:

• Query Stratification and Planning. The orchestrator first analyzes the semantic type and difficulty of the input query (e.g., depth-first vs. breadth-first) to determine research strategy and allocate a corresponding budget of agents, tool calls, and synthesis passes.

• Delegation and Scaling. Effort scales with complexity: from 1–2 agents for factual lookups to up to 10 or more for multi-perspective analyses, each assigned with clear quotas and stopping criteria to enable dynamic budget reallocation.

• Task Decomposition and Prompt Specification. The main query is decomposed into modular subtasks, each encoded as a structured prompt specifying objectives, output schema, citation policy, and fallback actions to ensure autonomy with accountability.

• Tool Selection and Evidence Logging. A central tool registry (e.g., web fetch, PDF parsing, calculators) is used following freshness, verifiability, and latency rules. Agents record all tool provenance in an evidence ledger for traceable attribution.

• Parallel Gathering and Interim Synthesis. Worker agents operate concurrently while the orchestrator monitors coverage, resolves conflicts, and launches micro-delegations to close residual gaps or trigger deeper reasoning where needed.

• Final Report and Attribution. The orchestrator integrates verified findings into a coherent report, programmatically linking claims to sources and ensuring schema compliance, factual grounding, and transparent citation.

Overall, Anthropic’s system exemplifies a scalable, interpretable multi-agent research paradigm that achieves high-quality synthesis through modular delegation, explicit budgeting, and verifiable reasoning.

Since manual collection of expert trajectories is labor-intensive, costly, and difficult to scale, a key challenge lies in automatically constructing high-quality SFT datasets. This has been widely explored by prior work [367, 483, 336, 48]. Below, we categorize representative work into two main paradigms: (i) strong-to-weak distillation, distilling correct task-solving trajectories from powerful LLMs (e.g., GPT-5 and DeepSeek-V3.1) into smaller, weak models; and (ii) iterative self-evolution, iteratively fine-tuning the model on the dataset produced by itself, leading to a progressive improvement.