Can we defend RAG systems from corpus poisoning without retraining?

Explores whether retrieval-time defenses can catch and block poisoned documents before they reach the generator, without expensive retraining cycles. Matters because corpus updates outpace model retraining in production RAG systems.

Note · 2026-05-03 · sourced from 12 types of RAG

RAG poisoning attacks insert malicious documents into the retrieval corpus so they get pulled in for matching queries and steer generation toward attacker-preferred outputs. Existing defenses typically require retraining the retriever or the generator, which is expensive and slow to deploy. RAGPart and RAGMask propose two lightweight defenses that operate at retrieval time without modifying the generation model.

RAGPart exploits a structural property of dense retrievers: they learn discriminative patterns from how the training data is partitioned, which means malicious documents inserted into one partition have predictably limited influence on retrieval from queries that match a different partition. By configuring partitions deliberately, the system bounds how far any single poisoned document can propagate. RAGMask takes a different angle: it masks tokens in candidate documents and watches for abnormal similarity shifts. Genuine documents are robust to token masking — their similarity scores degrade smoothly — while poisoned documents that rely on specific trigger tokens show sudden similarity collapse, which serves as a detection signal.

The architectural significance is that defense need not be coupled to training. Both methods sit at the retrieval layer and treat the generator as an untrusted black box that must be protected from upstream corruption. This separation matters operationally because retrieval corpora update faster than retrievers can be retrained, so defenses that require retraining are always behind the threat. The threat surface is real and severe — How vulnerable is GraphRAG to tiny text manipulations? shows even minimal corpus modifications can devastate accuracy in graph-structured RAG.

Source: 12 types of RAG

Related concepts in this collection

How vulnerable is GraphRAG to tiny text manipulations? GraphRAG converts raw text into knowledge graphs for question answering. This explores whether adversaries can degrade accuracy with minimal edits to source documents, and what makes the system susceptible.
extends: documents the threat severity that motivates these defenses; together they form an attack/defense pair on the same retrieval corpus surface
Can RAG systems safely learn from their own generated answers? Explores whether retrieval-augmented generation can feed its outputs back into the corpus without corrupting knowledge with hallucinations. The core problem: how to prevent feedback loops from compounding errors.
extends: bidirectional RAG opens a write surface that magnifies the poisoning attack vector; partition-aware retrieval and token-masking detection are exactly the kind of upstream defenses such systems will need
Can one compromised agent corrupt an entire multi-agent network? Explores whether a single biased agent can spread behavioral corruption through ordinary messages to downstream agents without any direct adversarial access. Matters because it reveals a previously unknown vulnerability in how multi-agent systems communicate.
extends: same lesson — defense at the message/retrieval layer beats trying to harden the generator; both attacks slip through ordinary content channels

Concept map

13 direct connections · 102 in 2-hop network ·medium cluster

Can we defend RAG systems from corpus poisoning … How vulnerable is GraphRAG to tiny text manipulati… Can RAG systems safely learn from their own genera… Can one compromised agent corrupt an entire multi-…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

RAG corpus poisoning has lightweight defenses without retraining — partition-aware retrieval and token-masking similarity shifts catch attacks the generator never sees