Knowledge Retrieval and RAG

Can smaller models handle RAG filtering while larger models focus on synthesis?

Does splitting RAG pipeline work between cheaper small models and expensive large models improve both cost and quality? The question asks whether different pipeline stages have different optimal model sizes.

Note · 2026-05-03 · sourced from 12 types of RAG

HiFi-RAG separates the RAG pipeline into stages handled by models of different capability and cost: a fast cheap model (Gemini 2.5 Flash) does query reformulation, prunes irrelevant retrieved passages, and attaches citations, while the large expensive model (Gemini 2.5 Pro) is invoked only at the final generation step. This is a tiering pattern with a specific theoretical justification: filtering and citation are pattern-matching tasks where the smaller model is sufficient, while final synthesis is where the large model's reasoning matters most.

The design implies a richer view of "RAG" than a single retrieve-then-generate pass. Each intermediate decision — which query to expand, which passages to keep, which spans to cite — has its own optimal cost-quality point, and forcing the most capable model to do all of them wastes compute on tasks where it offers no marginal benefit. The hierarchy also produces a useful side effect: because filtering happens before generation, the large model receives a smaller higher-quality context, which improves its answer quality even setting cost aside.

The general principle is that RAG architectures should think in terms of decision granularity rather than uniform model deployment. The retrieval pipeline contains several distinct sub-decisions, and matching each to an appropriately-sized model produces both cheaper and better answers — a Pareto improvement that uniform RAG misses because it treats retrieval and generation as a single coupled act. This is the RAG-specific instance of Can small language models handle most agent tasks? — heterogeneous tiered architectures are the economic imperative whenever subtasks have different capability requirements.


Source: 12 types of RAG

Related concepts in this collection

Concept map
15 direct connections · 109 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

hierarchical RAG splits filtering from generation across model tiers — small models prune and cite while large models only synthesize the final answer