What makes prerequisite filtering more reliable than semantic similarity matching?
This explores why filtering by whether a document meets an actual requirement (a prerequisite, a cause, a logical constraint) tends to beat ranking by how 'close' it feels in embedding space — and the corpus reads 'prerequisite filtering' as the broader family of structure-, cause-, and rationale-checking methods that semantic similarity can't replicate.
This explores why filtering by whether a document actually satisfies a requirement beats ranking it by embedding closeness. The short version the corpus keeps circling back to: semantic similarity measures *association*, not *relevance*. Embeddings reward text that looks topically related, but 'looks related' and 'is the thing you need' come apart constantly — and a prerequisite check asks the second question directly. Where do retrieval systems fail and why? names this as one of three architectural failures of RAG, not a tuning problem: cosine similarity is a measure of co-occurrence, so it can't tell you whether a passage meets the demand of the task.
The sharpest demonstration is causal: Why do queries and their causes seem semantically different? shows that when a student asks about 'projection' after a lecture, the semantically closest passage is the one full of the word 'projection matrix' — which is exactly *not* the statement that triggered the confusion. The cause and the surface match diverge. A prerequisite ('what did this query actually depend on?') filters correctly where similarity confidently retrieves the wrong segment. The same gap shows up in evidence selection: Can rationale-driven selection beat similarity re-ranking for evidence? has an LLM generate a *rationale* — a reason a chunk should be included — and beats similarity re-ranking by 33% with half the chunks. The rationale is a prerequisite test; similarity is a vibe.
The reason prerequisite checks are more *reliable* (not just more accurate on average) is that similarity fails on structural near-misses — documents that share all the surface tokens but violate the constraint that matters. Can verification separate structural near-misses from topical matches? makes this an explicit two-stage design: cheap cosine recall first, then a learned verifier reading full token-to-token interaction patterns to *reject* the near-misses that compressed vectors wave through. Verification is a separate task downstream of similarity precisely because similarity cannot do it. And Can long-context LLMs replace retrieval-augmented generation systems? shows the ceiling from the other side: stuff everything into a long context and the model still can't execute relational queries that need a join — a hard structural prerequisite that no amount of semantic proximity satisfies.
The lateral payoff is that 'prerequisite filtering' is really one move applied across very different problems: route by what structure the task needs (Can routing queries to task-matched structures improve RAG reasoning? picks tables vs. graphs vs. chunks by query demand), or even defend against poisoning by flagging documents whose similarity *collapses abnormally* under token masking (Can we defend RAG systems from corpus poisoning without retraining?). The thread running through all of them: a constraint you can check has a defined failure mode and a defined pass condition. Similarity only has a gradient — and on the cases where it's confidently wrong, no threshold rescues it. That's the thing worth taking away: similarity degrades gracefully into plausible-looking nonsense, while a prerequisite either holds or it doesn't, which is exactly what makes it trustworthy.
Sources 7 notes
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.
Backtracing—finding what caused a query—diverges from semantic similarity especially in conversation and lecture domains. Students ask about projection after hearing a specific statement, but the semantically closest passage discusses projection matrices instead, showing that surface similarity misses the actual cause.
METEORA uses LLM-generated rationales with flagging instructions to select evidence, achieving 33% better accuracy with 50% fewer chunks than similarity re-ranking across legal, financial, and academic domains. The method also improves adversarial robustness substantially.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.