Can smaller models handle RAG filtering while larger models focus on synthesis?

Does splitting RAG pipeline work between cheaper small models and expensive large models improve both cost and quality? The question asks whether different pipeline stages have different optimal model sizes.

Note · 2026-05-03 · sourced from 12 types of RAG

HiFi-RAG separates the RAG pipeline into stages handled by models of different capability and cost: a fast cheap model (Gemini 2.5 Flash) does query reformulation, prunes irrelevant retrieved passages, and attaches citations, while the large expensive model (Gemini 2.5 Pro) is invoked only at the final generation step. This is a tiering pattern with a specific theoretical justification: filtering and citation are pattern-matching tasks where the smaller model is sufficient, while final synthesis is where the large model's reasoning matters most.

The design implies a richer view of "RAG" than a single retrieve-then-generate pass. Each intermediate decision — which query to expand, which passages to keep, which spans to cite — has its own optimal cost-quality point, and forcing the most capable model to do all of them wastes compute on tasks where it offers no marginal benefit. The hierarchy also produces a useful side effect: because filtering happens before generation, the large model receives a smaller higher-quality context, which improves its answer quality even setting cost aside.

The general principle is that RAG architectures should think in terms of decision granularity rather than uniform model deployment. The retrieval pipeline contains several distinct sub-decisions, and matching each to an appropriately-sized model produces both cheaper and better answers — a Pareto improvement that uniform RAG misses because it treats retrieval and generation as a single coupled act. This is the RAG-specific instance of Can small language models handle most agent tasks? — heterogeneous tiered architectures are the economic imperative whenever subtasks have different capability requirements.

Source: 12 types of RAG

Related concepts in this collection

Can small language models handle most agent tasks? Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
extends: same heterogeneous-architecture economic argument applied to RAG sub-decisions — filtering and citation are the SLM-suitable subtasks; final synthesis is the LLM-required subtask
Do hierarchical retrieval architectures outperform flat ones on complex queries? Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.
extends: structurally analogous; HierSearch separates planning/synthesis at the system level; HiFi-RAG separates filtering/synthesis at the model-tier level
Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
extends: HiFi-RAG inverts the usual move — instead of more compute on a smaller model, it allocates the larger model only to the genuinely hard step; both treat compute-allocation as the lever rather than uniform scaling
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
extends: same adaptive allocation principle, applied to model selection across pipeline stages rather than to compute per query

Concept map

15 direct connections · 109 in 2-hop network ·medium cluster

Can smaller models handle RAG filtering while la… Can small language models handle most agent tasks? Do hierarchical retrieval architectures outperform… Can inference compute replace scaling up model siz… Can we allocate inference compute based on prompt …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

hierarchical RAG splits filtering from generation across model tiers — small models prune and cite while large models only synthesize the final answer