How Far Are We from Genuinely Useful Deep Research Agents?

Paper · arXiv 2512.01948 · Published December 1, 2025

Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive report remains overlooked. Worse, current benchmarks for report synthesis suffers from task complexity and subjective metrics—this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Finegrained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human–LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.

However, despite their promising application potential, DRAs still fall short of expectations in real-world report generation tasks [6–10]. Existing benchmarks are mostly tailored for question-answering (QA) [11–14] or other types of close-ended tasks [15], fail to fully capture the nuances and strict requirements of practical deep research scenarios—where higher standards are imposed on the quality, accuracy, depth, and logical coherence of generated reports. Although a considerable number of open-ended benchmarks currently exist [6–9, 16], their tasks often stem from LLM-driven sampling or synthesis, leading to deviations from human demands and insufficient complexity.

Our experimental evaluation on FINDER and DEFT of various DRAs, including proprietary systems [1–3], opensource models [19–24], and agent frameworks [21, 25– 32], reveals several key insights. While systems like Gemini perform well across general benchmarks, our analysis shows that over 39% of failures arise in content generation, particularly through strategic content fabrication, where agents tend to generate unsupported but seemingly professional content. Furthermore, retrieval-related failures, such as insufficient evidence integration and fact-checking issues, account for over 32% of errors, highlighting the challenges DRAs face in managing and verifying the quality of retrieved information. These results underscore that the core challenges for DRAs are not limited to simple task comprehension but instead involve deeper issues in evidence verification and reasoning resilience. To summarize, our contributions are as follows,

Reasoning Failure to Understand Requirements (FUR) Lack of Analytical Depth (LAD) Limited Analytical Scope (LAS) Rigid Planning Strategy (RPS) Retrieval Insufficient External Information Acquisition (IIA) Information Representation Misalignment (IRM) Information Handling Deficiency (IHD) Information Integration Failure (IIF) Verification Mechanism Failure (VMF) Generation Redundant Content Piling (RCP) Structural Organization Dysfunction (SOD) Content Specification Deviation (CSD) Deficient Analytical Rigor (DAR) Strategic Content Fabrication (SCF)

3.2.2 Axial Coding

Axial coding employs both deductive and inductive reasoning to explore relationships among concepts based on semantics, context, process, causality, function, structure, and strategy [41]. Through merging, splitting, removing, or modifying these relationships, it forms axial categories. At this stage, we conducted three rounds of coding based on inter-coder reliability (ICR) assessments: the first round utilized open coding results from Group A (Table F.1), while the second and third rounds incorporated all open coding results alongside the first-round axial coding outcomes. ICR measures the consistency among coders when encoding the same data [42] and has been demonstrated to consolidate [43, 44] or validate [45] existing coding frameworks.

Selective coding synthesizes the concepts and categories developed in the first two coding stages to establish overarching core categories. It clarifies their interrelationships and connects them through systematic logical threads [17]. At this stage, we repeatedly analyzed the axial categories derived from axial coding, ultimately distilling three core categories: Reasoning, Retrieval, and Generation. Functionally, these three core categories form a complete closed-loop for agent task execution. Temporally, they are interwoven and sequentially progressive, collectively underpinning a systematic understanding of agent failure mechanisms.

We randomly selected 36 execution records (six each from the Chinese and English part) generated by two agents not involved in the taxonomy construction stage, WebThinker and OpenManus, for coding analysis. No new categories emerged during this process, indicating that our categorization system had achieved theoretical saturation and demonstrated the explanatory power and stability required by grounded theory [48].

Root Cause Analysis

Root Cause of Factual Hallucination:

• Generative Extrapolation to Fulfill “Depth”: The prompt’s demand for a “deeply,” “systematically,” and “comprehensively” analyzed report likely pushed the model beyond its knowledge base. To create a more dynamic and seemingly insightful narrative about the “long-term sustainability considerations” of the plugin ecosystem, the model extrapolated a known pattern—that community plugins can become abandoned—and fabricated specific, future-dated examples (“archived in 2025”). This is a pathological attempt to demonstrate “deep understanding” by creating a story where none exists.

• Concept Blending and Plausible Invention: The creation of “Obsidian Bases” is likely a result of the model blending community discussions and desires for a native database solution in Obsidian. It synthesized a plausible name (“Bases”) and status (“official core plugin”) to satisfy the prompt’s request to analyze the ecosystem’s core components. This demonstrates a failure to distinguish between community speculation and official product roadmaps.

Root Cause of Misrepresented Evidentiary Basis:

• Mimicry without Substance: This is a classic LLM failure mode. The agent correctly identified the linguistic style and structure of a software evaluation report. It understands that such reports contain sections based on user interviews and quantitative benchmarks. However, lacking the ability to conduct such research, it defaults to its core function: generating text that mimics the expected output. It interprets the instruction “Include user interviews” as “Write in a style that suggests user interviews were conducted.” This reveals a fundamental gap between understanding a request’s semantics and possessing the capability to execute it.

Root Cause of Formatting and Structural Failures: • Path of Least Resistance for Text Generation: Language models are optimized to generate fluid, sequential prose. Creating structured data like a detailed feature matrix requires more complex planning, token-level precision, and a different generative process. The agent chose the easier path of writing descriptive paragraphs comparing features, which fulfills the prompt’s topic requirement but fails its specific formatting instruction. It’s an optimization for word count and topical coverage over structural rigor.
Root Cause of Superficial Depth:

• Lack of True Domain Expertise: The agent’s “knowledge” is a statistical representation of its training data. While it can retrieve and synthesize information about query languages, it cannot perform the abstract reasoning required for a true “deep dive.” It can state that DataviewJS is more powerful than a UI, but it cannot architect a complex project using both and then analyze the nuanced trade-offs from a position of genuine experience. The “depth” is therefore wide but shallow, covering all the requested topics without the penetrating insight of a true expert.

In summary, the agent’s failure is rooted in its attempt to meet a prompt that demands capabilities beyond its design—namely, empirical research, future prediction, and genuine expert analysis. Pressured to deliver a “deep” and “comprehensive” report, it resorted to its most advanced but dangerous capabilities: plausible fabrication and stylistic mimicry, ultimately producing a response that is superficially impressive but factually untrustworthy and methodologically hollow

Axial Category Definitions

Failure to Understand Requirements (FUR).

The system fails to correctly interpret user requirements, intent, or contextual needs, focusing on superficial keyword matches rather than the actual problem, resulting in responses that don’t align with the user’s goals.

Lack of Analytical Depth (LAD).

The agent fails to probe the underlying mechanisms, structural constraints, or conceptual nuances of complex problems and instead relies on surface-level logic or oversimplified frameworks, producing analyses that lack rigor and systemic coherence.

Limited Analytical Scope (LAS).

The agent’s constrained cognitive scope when addressing multidimensional tasks, resulting in analyses that remain confined to partial dimensions or isolated elements, and fail to capture holistic structures, crossdimensional relationships, or systemic insights.

Rigid Planning Strategy (RPS).

The agent’s adherence to a fixed, linear execution plan without dynamically adapting its planning logic in response to output requirements, intermediate feedback, or evolving task states, thereby leading to inefficiency, error propagation, or degraded output quality.

Insufficient External Information Acquisition (IIA).

The agent fails to proactively gather the necessary external information, instead relying too heavily on internal knowledge or prior assumptions, thereby producing outputs that lack empirical grounding, exhibit incomplete coverage, or deviate from task requirements.

Information Representation Misalignment (IRM).

The agent fails to distinguish and present information appropriately based on user needs or evidence reliability, thereby weakening the relevance, credibility, and authority of the information.

Information Handling Deficiency (IHD).

The agent fails to properly extract, prioritize, or utilize critical information from available sources to fulfill detailed requirements or adapt its task approach.

Information Integration Failure (IIF).

The agent fails to maintain consistency and verifiability when handling multi-source inputs and multi-stage tasks, resulting in outputs that contain factual contradictions, logical inconsistencies, or unsubstantiated claims, alongside a lack of effective alignment across data sources and processing standards.

Verification Mechanism Failure (VMF).

Before generating content, the system fails to perform necessary steps to verify information sources or cross-check data, resulting in outputs that do not cite required sources and lack factual grounding.

Redundant Content Piling (RCP).

The agent, when lacking substantive content or effective organization, tends to pile up redundant information to fill gaps or create an illusion of thoroughness, thereby undermining the clarity and utility of its output.

Structural Organization Dysfunction (SOD).

The agent lacks holistic coordination in structuring its analysis, failing to balance coverage across key dimensions or establish meaningful connections among elements, resulting in fragmented and unsystematic outputs.

Content Specification Deviation (CSD).

The agent’s output deviates from the professional standards or user expectations required by the task in terms of language style, tone, format, or cultural context, resulting in inappropriate or ineffective responses.

Deficient Analytical Rigor (DAR).

The agent generates content without sufficient rigor, often ignoring task feasibility, omitting uncertainty disclosures, using vague or decontextualized language, lacking actionable implementation details, and presenting unverified conclusions with unwarranted confidence.

Strategic Content Fabrication (SCF).

The agent engages in strategic content fabrication by generating plausible but unfounded academic or empirical constructs—such as methods, data, or case narratives—that mimic scholarly rigor to create a false impression of credibility.