How do real search queries reveal what counts as a deep research question?
This explores whether the 'depth' of a research question is something you can read off its wording, or whether it only shows up in how real queries behave when a system tries to answer them — how much searching, hopping, and synthesis they force.
This explores whether 'depth' lives in a question's phrasing or in what a question actually demands when you try to answer it — and the corpus leans hard toward the second. The cleanest formal answer is that a deep research question is one that simultaneously forces multi-step information gathering, synthesis across sources, and iterative refinement of the query itself; drop any one component and you've slid back into ordinary RAG What makes deep research fundamentally different from RAG?. Notice what that definition is doing: it doesn't classify questions by topic or difficulty-as-felt, it classifies them by the *process* they require. Depth is operational.
That reframing matters because not all questions that look hard demand the same machinery. Work on non-factoid answering finds that questions split into types — evidence, comparison, debate, experience, reason — and each type calls for a different retrieval and aggregation strategy, with only the simplest suiting plain RAG Does question type determine the right retrieval strategy?. So the first thing real queries reveal is that 'deep' isn't one thing; it's a cluster of distinct demands, and a question announces which cluster it belongs to by how it resists a single lookup.
The most striking signal, though, is quantitative. Deep research agents improve with more search steps along a curve that mirrors the reasoning-token scaling law — monotonic gains tapering into diminishing returns Do search steps follow the same scaling rules as reasoning tokens? Does search budget scale like reasoning tokens for answer quality?. That gives you a working definition by behavior: a question is deep to the degree that throwing more search at it keeps paying off. A shallow query plateaus after one hop; a deep one keeps rewarding additional retrieval. Relatedly, the questions where live search agents decisively beat memorized models are exactly the ones where the answer sits outside training-data compression or past the knowledge cutoff Why do search agents beat memorized retrieval on hard questions? — depth revealed as the gap between what can be recalled and what must be gone and fetched.
You can even see researchers reverse-engineering this. To *manufacture* genuinely deep questions for training, one approach walks random paths through a knowledge graph and deliberately blurs the entities, producing verifiable multi-hop questions that can't be shortcut Can knowledge graphs generate training data for search agents?. The recipe for a hard question is essentially the inverse of the depth definition: force multiple hops, deny a direct match. And the dark mirror confirms the pressure is real — when agents face a depth demand they can't actually meet, a large share of their failures come from fabricating examples and evidence to *mimic* rigor Why do deep research agents fabricate scholarly content?. Depth-demand is concrete enough that systems will counterfeit it under stress.
The quietly subversive takeaway: the cues humans use to judge depth are unreliable. Users rate answers as better simply when they carry more citations, whether or not those citations are relevant Do users trust citations more when there are simply more of them?. So 'looks deep' and 'is deep' come apart — readers track a surface heuristic while the real signature of a deep question is structural, showing up in things like whether it needs query planning split from answer synthesis Do hierarchical retrieval architectures outperform flat ones on complex queries? or whether a model's own uncertainty says one retrieval won't suffice Can simple uncertainty estimates beat complex adaptive retrieval?. Real queries reveal what counts as deep not by how they sound but by how many hops they cost, how badly memory fails them, and how much synthesis they refuse to collapse into a single answer.
Sources 10 notes
The Characterizing Deep Research paper establishes that genuine deep research must combine multi-step information gathering, cross-source synthesis, and iterative query refinement operating together. Systems lacking any component—such as those skipping iterative refinement—fall short of the definition and show different scaling behavior.
Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.
Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
DeepResearcher agents trained on live web search beat static knowledge models on knowledge-intensive tasks. The mechanism is not better reasoning but retrieval: real-time search avoids temporal bounds and probabilistic compression that plague training-data memorization.
KG-based random walks with selective entity obscuring create verifiable, multi-hop questions that train deep search agents effectively. DeepDive-32B trained on this data achieves 14.8% on BrowseComp, outperforming larger models through end-to-end multi-turn RL.
Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.
Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.