What makes deep research fundamentally different from RAG?
Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.
"Deep research" is used loosely to describe anything from a single web search to a multi-hour autonomous investigation. The Characterizing Deep Research paper proposes a formal three-component definition that makes the boundary precise:
- Multi-step information gathering — not one retrieval round but a sequence of them, where each round can expand or contract the search space
- Cross-source synthesis — combining findings from multiple independent sources, not just summarizing one document
- Iterative query refinement — using partial findings to improve subsequent queries, not issuing all queries upfront
The definition excludes single-step RAG (fails component 1), document summarization (fails component 3), and simple web browsing (may fail component 2). It includes only systems that loop across all three simultaneously.
The practical value of the definition is benchmarking clarity. Without it, systems that perform single-step retrieval with sophisticated synthesis can claim "deep research" capability when they lack the iterative refinement component that actually distinguishes DR from RAG++. PRELUDE (the benchmark that accompanies the paper) evaluates all three components, making it possible to locate exactly where a system falls short.
This also clarifies what the TTS law applies to: Does search budget scale like reasoning tokens for answer quality? is a scaling law specifically for systems that meet the full three-component definition. Partial systems that skip iterative query refinement likely show different scaling behavior.
Researchy Questions (2024) operationalizes the "unknown unknowns" concept for deep research. Unlike standard QA benchmarks that study "known unknowns" with clear indications of what information is missing, Researchy Questions identifies non-factoid, multi-perspective, decompositional questions from real search engine logs — questions where the questioner doesn't know what they don't know. Users spend significantly more effort (clicks, session length) on these queries, and "slow thinking" techniques like decomposition into sub-questions show benefit over direct answering. An 8-dimension quality rubric (ambiguity, incompleteness, assumptions, multi-facetedness, knowledge-intensity, subjectivity, reasoning-intensity, harmfulness) provides granular characterization. This distinguishes "deep" questions from merely "hard" ones: a deep question has multiple perspectives allowing a dense manifold of answers, no single correct answer, and requires genuine synthesis rather than just retrieval. Source: Arxiv/Agentic Research.
Source: Deep Research
Related concepts in this collection
-
Does search budget scale like reasoning tokens for answer quality?
Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
grounds: the TTS law applies specifically to systems meeting this formal definition; the three components define what search budget measures
-
Do hierarchical retrieval architectures outperform flat ones on complex queries?
Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.
connects: hierarchical architecture is the structural implementation of the three-component definition
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
deep research requires a formal three-component definition: multi-step information gathering, cross-source synthesis, and iterative query refinement