Can lightweight linguistic features reliably detect LLM generated arguments?
This explores whether cheap, transparent linguistic signals — not heavyweight neural detectors — can spot AI-written arguments, and why LLM prose leaves a detectable fingerprint in the first place.
This explores whether cheap, transparent linguistic signals can reliably catch LLM-generated arguments, and the corpus has a direct, striking answer: yes. On Reddit's r/ChangeMyView, a bundle of general linguistic features plus argument-quality measures hit 99% accuracy detecting AI counter-arguments — matching expensive neural detectors while staying computationally cheap and human-readable Can simple linguistic features detect AI-written arguments?. The tell isn't subtle errors; it's the opposite. LLMs over-accommodate the prompt and produce "textbook-quality" argument markers that real people don't bother to replicate. The machine is too clean.
That cleanliness is worth dwelling on, because the corpus suggests it's structural, not accidental. Token generation is described as a smooth probabilistic flow that continues toward the training distribution rather than wrestling with competing claims — so the model multiplies tidy, on-distribution statements instead of generating the friction a real arguer shows when weighing counterpositions Does LLM generation explore competing claims while producing text?. There's also no fixed author behind the text: regenerate the same prompt and you get different, each-internally-consistent outputs, because the model samples a character rather than committing to one Do large language models actually commit to a single character?. The detectable signature, in other words, is the residue of a process that smooths and samples rather than reasons and commits.
Here's the twist a curious reader might not expect: LLMs are good at *producing* arguments but shaky at *analyzing* them. They classify argumentation schemes only marginally — even large models barely clear F1 0.55, with Claude topping out around 0.65, and only with few-shot examples plus scheme descriptions Can large language models classify argument schemes reliably?. So the very polish that makes LLM arguments easy for a lightweight classifier to flag is not matched by the model's own grasp of argument structure. Generation outruns comprehension.
Why lean linguistic features work at all connects to a deeper pattern in the corpus: LLM behavior has systematic, *predictable* surface regularities. Models stumble in characteristic ways on syntactic complexity Why do large language models fail at complex linguistic tasks?, and their failures are forecastable once you treat them as autoregressive probability machines Can we predict where language models will fail?. Predictable surface behavior is exactly what cheap, interpretable features can exploit — you don't need a black box to catch a pattern that's regular by construction.
The honest caveat the corpus implies: the 99% result is one domain (counter-arguments on one forum), and the signal is partly stylistic over-quality — the kind of thing that could erode as models are tuned to sound more human, or shift across genres. But for now the answer leans firmly yes, and the more interesting takeaway is *why*: detection works because LLM argumentation is fluent without being effortful, and that effortlessness is itself the giveaway. If you want to push further, the structured-prompting work on forcing models to check warrants hints at what 'effortful' machine argument might eventually look like Can structured argument prompts make LLM reasoning more rigorous?.
Sources 7 notes
General linguistic features combined with argument-quality measures achieved 99% accuracy detecting LLM-generated counter-arguments on r/ChangeMyView, matching heavyweight neural detectors while remaining computationally cheap and transparent. LLMs produce detectable stylistic signatures: accommodation to prompts and textbook-quality argument markers that humans don't replicate.
Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.