Can document count be learned instead of fixed in RAG?
Standard RAG systems use a fixed number of documents regardless of query complexity. Can an RL agent learn to dynamically select both how many documents and their order based on what helps the generator produce correct answers?
Every standard RAG re-ranking system passes a fixed k documents to the generator. The k is set by the system designer and held constant across queries. This is wrong in both directions: too few documents omit critical information for complex queries; too many documents introduce noise that misleads the generator and reduces efficiency.
The k selection problem is unsolved by all pre-DynamicRAG re-ranking approaches. Re-rankers have improved document ordering but assumed k was given. The number of documents to retrieve is treated as a hyperparameter, not a learned decision.
DynamicRAG models the reranker as an RL agent whose action space is a permutation and count selection over retrieved documents. The reward is LLM output quality — specifically, whether the generator produces a correct answer given the selected document set. The agent receives both explicit query signals and the generator's feedback.
Training proceeds in two phases. First, behavior cloning on expert trajectories (SFT) gives the reranker a baseline policy and reduces action space complexity. Second, RL with generator feedback allows the reranker to explore and learn to calibrate both ordering and count to query needs.
The insight generalizes beyond re-ranking: any RAG system parameter that is currently a heuristic (chunk size, retrieval depth, context window allocation) is a candidate for learning via generator feedback. The generator's output quality is a reward signal that can backpropagate through any component of the pipeline that affects what the generator receives.
Source: RAG
Related concepts in this collection
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
same adaptive allocation principle applied to document selection; optimal k depends on query, not system configuration
-
Can retrieval learn what actually helps answer questions?
Standard RAG trains retrievers to find similar documents and generators to produce answers separately. But does surface similarity match what genuinely helps generate correct responses? This explores whether retrieval can receive feedback from answer quality.
CLaRa addresses the same generator-feedback problem via continuous representations; DynamicRAG addresses it via RL
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
same RL mechanism at a different level: RL prunes wrong reasoning paths in domain contexts, DynamicRAG prunes wrong document selections; in both cases RL refines an existing process by suppressing suboptimal choices rather than adding new capability
-
Does supervising retrieval steps outperform final answer rewards?
Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
complementary RL approaches to RAG: DynamicRAG learns document count/order via RL, RAG-Gym learns intermediate retrieval step quality via process supervision; together they show RL can optimize both the what-to-include and how-to-retrieve aspects of RAG
-
Can rationale-driven selection beat similarity re-ranking for evidence?
Can LLMs generate search guidance that outperforms traditional similarity-based evidence ranking? This matters because current re-ranking lacks interpretability and fails against adversarial attacks.
both solve fixed-k but via different mechanisms: DynamicRAG learns adaptive k through RL with generator feedback, METEORA eliminates k via rationale-match elbow detection; DynamicRAG is training-time optimization, METEORA is inference-time architecture
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
rl-trained reranker that adjusts document order and count solves the fixed top-k problem in rag