What are the 27 external features that predict retrieval need?
This explores a specific finding — that 27 cheap, surface-level features of a question alone can predict whether a system needs to go fetch outside information — and places it inside the larger corpus debate over how systems should decide when to retrieve at all.
This explores a specific result: that a handful of "external" question features — properties you can read off the question itself, before the model even tries to answer — can predict whether retrieval is worth doing. The corpus doesn't hand you a labeled list of all 27, but it tells you what kind of features they are and why they matter. They're called *external* and *lightweight* precisely because they don't require running the model and measuring its confidence: think question length, entity counts, question type (who/what/when), presence of rare or named terms, syntactic complexity — cheap signals computed from the text. The striking claim in Can question features alone predict when to retrieve? is that a learned predictor over these 27 features *matches* far more expensive uncertainty-estimation methods across six QA datasets, and actually *beats* them on complex questions — at a fraction of the cost.
Why does this matter beyond saving compute? Because it reframes the deciding-when-to-retrieve problem. The dominant alternative is to let the model introspect — generate an answer, measure how uncertain it is, and retrieve only when it's shaky. When should language models retrieve external knowledge versus use internal knowledge? takes the introspective route to its logical end, treating each reasoning step as a decision about whether to lean on internal knowledge or reach for external knowledge, and reports a ~22% accuracy gain mostly from *not* retrieving when retrieval would just add noise. The external-features result says: you can get much of that selectivity without the expensive self-interrogation — the question's surface already leaks whether it's the kind of thing the model knows.
The deeper reason this is interesting is that "when to retrieve" turns out to be one of the load-bearing failure points of RAG, not a tuning detail. Where do retrieval systems fail and why? names adaptive triggering as one of three *structural* failures — fixed-interval retrieval wastes context by fetching when nothing is needed and starving the model when something is. How should systems retrieve and reason with external knowledge? echoes that retrieval should adapt dynamically and couple tightly with reasoning rather than fire on a schedule. So the 27 features aren't a niche trick; they're one cheap answer to a problem the corpus treats as architectural.
There's a lateral tension worth seeing. The external-features approach keeps the *retriever* as a separate gatekeeper deciding from the outside. Other lines in the corpus argue the model itself should make the call: Can models decide better than retrievers which tools to use? shows models emitting their own structured requests for tools beats a passive retriever guessing, and Can retrieval learn what actually helps answer questions? pushes the decision inward by training the retriever on whether retrieved documents actually improved the answer. Read together, the question "what features predict retrieval need?" sits between two philosophies: predict it cheaply from the outside, or let the system learn it from the inside. The surprising takeaway is that the cheap outside view holds its own — and is hardest to beat exactly where it counts, on the complex questions.
Sources 6 notes
Learned predictors using 27 lightweight external question features match complex uncertainty-based methods on overall performance while costing far less, and outperform them on complex questions across 6 QA datasets.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.
Research shows retrieval should adapt dynamically rather than follow fixed patterns, reasoning and retrieval must integrate closely, and embedding-based retrieval has fundamental limits requiring architectural alternatives.
MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.
CLaRa propagates generator loss back through continuous document representations, allowing retrievers to optimize for documents that actually improve answers rather than surface similarity. The gap between relevance and usefulness closes when retrieval receives direct feedback from generation success.