Can adversarial paraphrasing defeat feature-based detection of LLM text?

This explores whether rewording AI text (adversarial paraphrasing) can erase the stylistic 'fingerprints' that cheap, feature-based detectors rely on — and the corpus addresses it sideways, through what those detectors actually key on and how AI systems behave under rewording.

This explores whether rewording AI-generated text can defeat the lightweight, feature-based detectors that flag it. The collection doesn't have a head-to-head 'paraphrase attack vs. detector' study, but it has the two halves you'd need to reason about one — and they point in tension.

The case for detectors being robust: the signal they catch is structural, not cosmetic. A detector using only interpretable linguistic features hit 99% accuracy spotting AI-written arguments, matching heavyweight neural models, because LLMs leave consistent tells — over-accommodation to the prompt and a 'textbook-quality' argument polish humans rarely produce Can simple linguistic features detect AI-written arguments?. Style detection saturates early and easily: a model as old as GPT-2 identifies authorship from style patterns alone at 95% Can language models truly understand literary style?. If the giveaway lives in deep argument shape and pattern-level style rather than word choice, surface paraphrasing may not reach it.

But the corpus also suggests why paraphrasing is a double-edged blade. LLMs have a built-in pull toward high-frequency surface forms — when given semantically equivalent options, they systematically prefer the textually common phrasing over rarer wordings Do language models really understand meaning or just surface frequency?. An LLM asked to paraphrase is, by its own machinery, drifting toward statistically typical language, which is itself a fingerprint. Worse, generation flows smoothly toward the training distribution rather than exploring genuinely different phrasings Does LLM generation explore competing claims while producing text?. So 'adversarial paraphrasing' performed by another LLM may just relocate the signature rather than remove it.

The more interesting angle the corpus opens: attacks that work tend to be the ones that exploit a detector's blind spots without touching content at all. Research on LLM judges shows they can be fooled in zero-shot, no-model-access attacks by adding fake authority signals and rich formatting — biases that are 'semantics-agnostic,' meaning they fire regardless of the actual text Can LLM judges be fooled by fake credentials and formatting? Can LLM judges be tricked without accessing their internals?. That reframes the question: the cheapest way past a feature-based detector might not be paraphrasing the prose but manipulating the features it scores — a reminder that any detector keying on a fixed, interpretable feature set is only as strong as the assumption that attackers won't target those exact features.

What you'd take away: the detectors in this collection win because LLMs fail to hide *deep* structural habits — and the same statistical conformity that produces those habits is what an LLM paraphraser falls back into when asked to disguise itself. The unresolved frontier the corpus hints at isn't paraphrasing the words; it's gaming the feature set directly.

Sources 6 notes

Can simple linguistic features detect AI-written arguments?

General linguistic features combined with argument-quality measures achieved 99% accuracy detecting LLM-generated counter-arguments on r/ChangeMyView, matching heavyweight neural detectors while remaining computationally cheap and transparent. LLMs produce detectable stylistic signatures: accommodation to prompts and textbook-quality argument markers that humans don't replicate.

Can language models truly understand literary style?

GPT-2 achieves 95% accuracy identifying authorship through style patterns alone, but lacks the evaluative framework to explain why those stylistic choices carry meaning. Detection without interpretation remains cataloguing, not criticism.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Can adversarial paraphrasing defeat feature-based detection of LLM text?

Sources 6 notes

Next inquiring lines