What makes evidence selection vulnerable to adversarial poisoning attacks?

This explores why the step where a system picks which evidence to trust — in RAG retrieval, reasoning chains, and multi-agent relays — is a soft target for attackers who plant or warp the inputs.

This explores why the step where a system picks which evidence to trust — pulling chunks into a RAG pipeline, threading them through a reasoning chain, passing them between agents — is a soft target for poisoning. The short version from the corpus: evidence selection mostly runs on similarity and surface plausibility, not on whether the content is actually trustworthy, and attackers exploit exactly that gap. The weakest link is that selection rewards looking-relevant over being-reliable. When LLM judges are shown to score answers higher just because they carry fake references or rich formatting, regardless of content quality Can LLM judges be tricked without accessing their internals?, you're watching the same flaw that lets a poisoned document look like the most retrievable one.

A second vulnerability is that selection doesn't need to be tricked at the meaning level at all. Query-agnostic adversarial triggers — sentences that have nothing to do with the actual problem — can be appended to inputs and still drive a 300% jump in reasoning errors, and triggers discovered cheaply transfer to stronger models How vulnerable are reasoning models to irrelevant text?. Even more striking, poison can carry no explicit semantic content and still spread: a single biased agent transmits behavioral corruption through six downstream agents using only normal messages, evading paraphrasing and detection defenses precisely because there's nothing flagrant to catch Can one compromised agent corrupt an entire multi-agent network?. Selection mechanisms tuned to spot 'bad-looking' text miss poison that looks ordinary.

Third, the vulnerability compounds the longer the evidence is processed. Multi-turn manipulative prompts knock 25–29% off reasoning-model accuracy because extended chains create more intervention points where one corrupted step propagates into a confident wrong conclusion Why do reasoning models fail under manipulative prompts? Are reasoning models actually more vulnerable to manipulation?. And there's a structural floor here: a Lipschitz-continuity analysis shows that more reasoning steps dampen input perturbations but never drive sensitivity to zero — there's always residual vulnerability you can't reason your way out of Can longer reasoning chains eliminate model sensitivity to input noise?. Poisoning planted upstream is also durable: at just 0.1% of pretraining data, denial-of-service, context-extraction, and belief-manipulation attacks survive standard safety alignment How much poisoned training data survives safety alignment?.

What's quietly encouraging is that the corpus also names the fix, and it's the inverse of the flaw: stop selecting on similarity alone. METEORA replaces similarity re-ranking with LLM-generated rationales that include explicit flagging instructions, and gets not just 33% better accuracy with half the chunks but substantially improved adversarial robustness — because asking 'why does this belong?' is harder to fool than 'does this look similar?' Can rationale-driven selection beat similarity re-ranking for evidence?. Lightweight retrieval-layer defenses make the same move: RAGPart bounds how much any single poisoned document can influence an answer, and RAGMask flags documents whose similarity collapses abnormally under token masking Can we defend RAG systems from corpus poisoning without retraining?. The most conservative option is to let the system decline — grounded-refusal prompts that answer only from reliable evidence and otherwise say nothing, trading coverage for integrity when the corpus is noisy or corrupted Can RAG systems refuse to answer without reliable evidence?.

The thread tying it together, and the thing you might not have come looking for: evidence selection is vulnerable wherever it optimizes for retrievability or plausibility instead of provenance. Every defense in the corpus works by re-introducing a reason-to-trust check — a rationale, an influence bound, an abnormality test, a refusal — at the exact moment selection happens.

Sources 10 notes

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

How vulnerable are reasoning models to irrelevant text?

Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.

Can one compromised agent corrupt an entire multi-agent network?

Research demonstrates that a single biased agent can transmit persistent behavioral corruption through six downstream agents in chain and bidirectional topologies using only normal inter-agent communication. The bias evades detection and paraphrasing defenses because it carries no explicit semantic content.

Why do reasoning models fail under manipulative prompts?

GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.

Are reasoning models actually more vulnerable to manipulation?

GaslightingBench-R shows that multi-turn manipulative prompts reduce reasoning model accuracy significantly more than standard models. Extended chains create more corruption points, allowing single wrong steps to propagate into confident incorrect conclusions.

Can longer reasoning chains eliminate model sensitivity to input noise?

Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Can rationale-driven selection beat similarity re-ranking for evidence?

METEORA uses LLM-generated rationales with flagging instructions to select evidence, achieving 33% better accuracy with 50% fewer chunks than similarity re-ranking across legal, financial, and academic domains. The method also improves adversarial robustness substantially.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

What makes evidence selection vulnerable to adversarial poisoning attacks?

Sources 10 notes

Next inquiring lines