How do token-masking patterns distinguish genuine documents from poisoned ones?

This explores RAGMask-style defenses, where masking tokens in a document and watching how its retrieval similarity reacts reveals whether the document earned its match honestly or was engineered to win.

This is really a question about *brittleness as a tell*. The core idea comes from retrieval-time RAG defenses Can we defend RAG systems from corpus poisoning without retraining?: a genuine document is relevant because its meaning is spread across many words, so if you randomly mask some tokens, its similarity to the query drops gradually and gracefully. A poisoned document is different — it was optimized to rank for a query by stuffing in a few adversarial trigger tokens, so its high similarity is balanced on a knife's edge. Mask the right handful of tokens and the match collapses abnormally fast. That sudden collapse, not the document's surface content, is the signature RAGMask flags. The companion technique, partition-aware retrieval, caps how much any single planted document can sway the answer in the first place.

What makes this interesting is that it's an instance of a much broader pattern in the corpus: adversarial artifacts leave *behavioral* fingerprints even when their content looks clean. The same intuition shows up in verification work that learns over full token-to-token similarity maps rather than compressed vectors — a small transformer reading the interaction pattern catches "structural near-misses" that pooled-cosine scoring waves through Can verification separate structural near-misses from topical matches?. In both cases the defense works by refusing to trust a single aggregate similarity number and instead probing how that similarity is *built*. Masking is just the cheapest way to ask: is this relevance robust, or load-bearing on a few tokens?

The corpus also suggests why a retrieval-time tripwire matters so much. Poison introduced earlier — during pretraining — is stubborn: denial-of-service, context-extraction, and belief-manipulation attacks at just 0.1% of the data largely survive standard safety alignment How much poisoned training data survives safety alignment?. If you can't reliably scrub poison out of the model, catching it at the moment of retrieval becomes a frontline rather than a backstop. And it pairs naturally with the opposite strategy of constraining the generator: grounded-refusal systems simply decline to answer when the retrieved evidence is too noisy or untrustworthy Can RAG systems refuse to answer without reliable evidence?. Masking screens what comes in; grounded refusal limits the damage of whatever slips through.

The thing you might not have known you wanted to know: this "perturb it and watch the reaction" trick recurs as a general detection philosophy. You can distinguish types of LLM falsehood by how much an answer *varies when regenerated* — fabrication wobbles, good-faith error stays stable Can we distinguish types of LLM falsehood by regeneration patterns?. You can catch AI-written or deceptive text through cheap, interpretable linguistic signatures rather than heavyweight models Can NLP detect deception through distinct linguistic patterns?. Token masking for poisoned documents is the retrieval-layer member of that family: instead of asking "is this content true?", it asks "does this thing behave the way honest things behave when you poke it?" — and lets the brittleness of the manipulation give it away.

Sources 6 notes

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Can verification separate structural near-misses from topical matches?

A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Can we distinguish types of LLM falsehood by regeneration patterns?

Shanahan's framework distinguishes fabrication (high variation), good-faith error (low variation, stable), and role-played deception (low variation, context-dependent) using behavioral tests alone. This avoids mentalistic language while enabling differential diagnosis for safety.

Can NLP detect deception through distinct linguistic patterns?

Research validates four complementary mechanisms of linguistic deception—distancing, cognitive load, reality monitoring, and verifiability avoidance—each with measurable NLP signatures including pronoun ratios, lexical complexity, concrete language use, and verifiable detail presence.

How do token-masking patterns distinguish genuine documents from poisoned ones?

Sources 6 notes

Next inquiring lines