Why do question types determine retrieval and decomposition strategy in QA?
This explores why the kind of question being asked — not just its topic — dictates how a QA system should fetch evidence and break the problem apart.
This explores why the kind of question being asked — not just its topic — dictates how a QA system should fetch evidence and break the problem apart. The most direct answer in the collection comes from work showing that non-factoid questions split into roughly five types, and each type wants a different retrieval and aggregation recipe Does question type determine the right retrieval strategy?. An evidence-seeking question is well served by ordinary RAG — find the passage, return it. But a comparison or debate question needs aspect-specific retrieval (you have to gather each side), and an experience or reasoning question needs to be decomposed into sub-questions or filtered before retrieval even makes sense. The question type, in other words, encodes the shape of the answer — and retrieval has to match that shape.
What makes this lateral rather than narrow is that several other notes converge on the same underlying claim from different angles: the right strategy is conditional, and a system that applies one fixed pipeline to every question pays for it. DeepRAG frames each reasoning step as a decision about whether to retrieve at all or lean on the model's internal knowledge, and gets a ~22% accuracy lift precisely by switching strategy per step instead of retrieving uniformly When should language models retrieve external knowledge versus use internal knowledge?. A simpler line of work reaches a parallel conclusion — calibrated uncertainty estimates can decide *when* retrieval is worth the cost, beating heavier adaptive schemes Can simple uncertainty estimates beat complex adaptive retrieval?. Both say the same thing the question-type work says: the trigger for retrieval is a property of the question, not a constant.
Decomposition shows the same conditionality. Multi-hop and complex queries benefit from separating query planning from answer synthesis into distinct stages, which reduces interference and outperforms flat pipelines Do hierarchical retrieval architectures outperform flat ones on complex queries?. And the unit of retrieval itself should bend to the question: how-to and procedural questions are badly served by fixed-size chunks, which sever the step-to-step dependencies, so 'logic units' that preserve prerequisites and links between steps work far better for that question type How do logic units preserve procedural coherence better than chunks?. A factual lookup never needs that machinery; a procedure always does.
There's a quieter insight worth pulling out: format and framing shape strategy more than content does. One study found that the *format* a model was trained on (multiple-choice vs. free-form) shaped its reasoning style — breadth-first vs. depth-first — about 7.5 times more strongly than the subject domain Does training data format shape reasoning strategy more than domain?. That's the same principle running underneath the whole question: the structural type of a problem governs how it should be attacked, and topic is secondary. If you want to go further into why this matters upstream, the work on training models to ask good clarifying questions shows that even *recognizing* what type of question is at hand — its clarity, specificity, what's missing — is itself a skill that has to be learned and decomposed Can models learn to ask genuinely useful clarifying questions?.
The thing you didn't know you wanted to know: 'one good RAG pipeline' is a category error. The field is quietly converging on the idea that QA isn't a single task but a family of tasks wearing the same costume, and the first real move in answering well is classifying which one you're holding.
Sources 7 notes
Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
THREAD replaces chunks with four-part logic units—prerequisite, header, body, linker—enabling dynamic multi-step retrieval for how-to questions. Linkers explicitly navigate between steps and branches, addressing both the semantic-vs-task-relevance gap in embeddings and the sequential dependency loss in chunk-based RAG.
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.