INQUIRING LINE

Why does bidirectional RAG amplify the risk of corpus poisoning attacks?

This explores why a RAG system that writes its own generated answers back into its retrieval corpus (bidirectional RAG) opens a wider door to poisoning than a read-only one — and what the corpus says about closing it.


This reads the question as being about the feedback loop, not the retrieval step alone: ordinary RAG only reads from a fixed corpus, but bidirectional RAG also *writes* generated answers back into it, so any bad content that slips in doesn't just produce one wrong answer — it becomes a future source that gets retrieved, cited, and built upon. The corpus frames this directly: Can RAG systems safely learn from their own generated answers? describes systems that grow their own knowledge base during use, and the entire reason they gate write-back behind entailment checks, source attribution, and novelty detection is that without those gates, a single hallucination or injected document compounds. The amplification is the loop — read, generate, write, re-read — turning a one-shot attack into a self-reinforcing one.

What makes this worse is that the things you'd write back are exactly the things that are hard to verify. Do frontier LLMs silently corrupt documents in long workflows? shows that even frontier models silently corrupt roughly a quarter of document content across long relay workflows, with errors compounding rather than plateauing — which is precisely the dynamic a write-back loop creates if generation quality isn't gated. And Where do retrieval systems fail and why? argues retrieval failure is architectural: embeddings measure association, not relevance, so a poisoned document that's merely *similar* to a query will surface regardless of whether it's true. Bidirectional RAG keeps feeding that flawed retriever new material it helped author.

The attacker's job also gets easier because poison is sticky. How much poisoned training data survives safety alignment? found that poisoning as small as 0.1% of training data survives safety alignment for denial-of-service, context-extraction, and belief-manipulation attacks. The analogy for a corpus is sharp: once a poisoned passage is written into a retrieval store, there's no alignment pass scrubbing it — it just sits there waiting to be retrieved. Worse, the verification step itself can be gamed: Can LLM judges be fooled by fake credentials and formatting? shows LLM judges fall for fake authority signals and rich formatting with zero model access, so if your write-back gate is an LLM checking 'is this trustworthy?', an attacker can dress poison up to pass.

The corpus does point at defenses, and they cluster around containment rather than detection-after-the-fact. Can we defend RAG systems from corpus poisoning without retraining? offers retrieval-time guards — partition-aware retrieval that bounds how much any one poisoned document can influence an answer, and token-masking that flags documents whose similarity collapses abnormally. Can RAG systems refuse to answer without reliable evidence? takes the opposite-but-complementary tack: refuse to answer at all without grounded evidence, trading coverage for integrity. Both matter more in a bidirectional system because the cost of a bad answer is no longer one bad answer — it's a permanent corpus entry.

The thing worth taking away: the danger of bidirectional RAG isn't that it can be poisoned — every RAG system can. It's that the write-back loop converts the corpus from a static target into a *cultivated* one, where the system's own mistakes and an attacker's injections both accumulate and amplify across cycles. That's why the credible designs (Can RAG systems safely learn from their own generated answers?) spend almost all their engineering on the gate between 'generated' and 'stored,' not on generation quality itself.


Sources 7 notes

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a security researcher auditing corpus-poisoning risk in bidirectional RAG systems. The question: does bidirectional (read-write) RAG amplify poisoning risk compared to read-only RAG, and if so, what defenses now work?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable.
• Write-back loops transform static corpora into self-cultivating targets; a single poisoned entry becomes retrievable source material for future generations, compounding via re-read cycles (~2024–2025).
• Frontier LLMs silently corrupt ~25% of document content across relay workflows, errors compounding rather than plateauing — exactly the risk write-back amplifies without quality gates (~2026).
• Poisoning as small as 0.1% of training data survives post-training alignment for DoS, context extraction, and belief manipulation; no analogous scrubbing exists for corpus entries (~2024).
• LLM judges (used as write-back gates) fall for fake authority, rich formatting, and zero-shot prompt injection with zero model access — so verification gates themselves are exploitable (~2024).
• Lightweight defenses exist: partition-aware retrieval (bounds poisoned document influence), token-masking (flags abnormal similarity collapse), refusal-without-grounding (trade coverage for integrity) (~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2410.13722 (2024-10): Persistent Pre-Training Poisoning of LLMs
• arXiv:2402.10669 (2024-02): Humans or LLMs as the Judge? Judgement Biases
• arXiv:2604.15597 (2026-04): LLMs Corrupt Your Documents When You Delegate
• arXiv:2505.16014 (2025-05): Ranking Free RAG: Replacing Re-ranking with Selection

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, assess whether newer orchestration (multi-agent, memory isolation, cache-aware retrieval), model scaling, training method (post-training alignment for RAG-specific integrity), or evaluation harnesses have since relaxed or overturned the write-back amplification. Separate the durable risk (write-back loops likely still exist and compound mistakes) from the perishable claim (e.g., 25% corruption rate, LLM judge vulnerability). Cite what resolved it; plainly state where the risk persists.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — any paper showing bidirectional RAG WITHOUT amplified risk, or defenses that *prevent* poisoned entries from persisting.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., does uncertainty quantification during write-back eliminate the amplification? Can retrieval-time filtering (CLaRa, UR2, adaptive reasoning) prevent poisoned entries from being useful to the system?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines