Characterizing Deep Research: A Benchmark and Formal Definition

Paper · arXiv 2508.04183 · Published August 6, 2025

Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of deep research — a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search—separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a diverse, challenging benchmark LIVEDRBENCH with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards).

For example, “provide me a list of all Oscar-winning movies that were adapted from books with women authors” seems to be a hard, research-oriented task, but may not qualify as DR if there’s a webpage providing exactly this information.

We propose that the DR task can be split into two subtasks: 1) synthesizing claims that collectively answer the user’s query, given a document corpus; and 2) writing a report based on these claims. By a claim, we refer to any piece of information that is relevant for answering a user’s query, wherein each claim may consist of sub-claims that support or provide evidence for the said information. Rather than the common expectation of a long report-like output, we posit that the defining element of a DR problem is the first sub-task: synthesis of relevant information that could answer a user’s question, given a document corpus. The information synthesis problem can be conceptualized as a directed acyclic graph (DAG), where the nodes represent information such as original query, retrieved documents and synthesized information while the edges represent actions such as issuing search queries and reasoning over retrieved documents (see Figure 1(b)).

Consequently, we can characterize a task by the structure of its search and reasoning DAG from the perspective of a human expert. Given a document corpus, we call a query a deep research query if 1) the number of information units to be processed for obtaining the final answer is high (search intensity); and 2) at least one of the tasks—finding these information units, processing them or combining them to form the final claims—requires non-trivial reasoning (reasoning intensity). An information unit roughly corresponds to an atomic piece of information, such as a paragraph or a chunk in a retrieval corpus. While quantifying search or reasoning intensity is subjective, we posit that DR corresponds to any query that takes more than 10 minutes for an ideal human expert.

Given the complex output, evaluating a DR model’s output is difficult. In particular, assessing the quality of a detailed report is an inherently subjective task, involving evaluations of style and substance. For an objective evaluation, we evaluate a DR output only on the substantive claims that it generates. The second subtask of generating a coherent and detailed report can be considered as an auxiliary long-form generation task (He et al., 2025), and evaluated using the standard metrics for long-form generation. Formally, given a retrieval corpus, we define the solution to a DR problem in terms of an intermediate output representation that consists of a list of claims that answers a user’s query. Then, a DR problem is defined as a tuple ⟨query, list of claims⟩ where an ideal solution gets all the claims and their (recursive) subclaims correct. Since users may typically evaluate both the claim and its supporting subclaims for assessing validity of the DR report, we define modified versions of Precision and Recall metrics, where the score for a claim is assessed zero if the claim is incorrect, but also if all its subclaims are incorrect.