Can forensic features reliably distinguish LLM arguments from human arguments?

This explores whether the measurable 'fingerprints' a text leaves behind — stylistic and structural markers — can tell apart arguments written by an LLM from those written by a person, and how durable those tells actually are.

This explores whether the measurable fingerprints a text leaves behind can reliably separate machine-made arguments from human ones — and the corpus is unusually concrete about it. The headline result is that yes, surprisingly cheap signals work: a bundle of interpretable linguistic features plus argument-quality measures hits 99% accuracy distinguishing LLM counter-arguments from human ones on r/ChangeMyView, matching heavyweight neural detectors while staying transparent enough to explain *why* Can simple linguistic features detect AI-written arguments?. So the answer to 'can forensic features distinguish them' is, at least in this setting, a strong yes.

The more interesting question is *what* those features are detecting, and here the corpus pulls apart two distinct tells. The first is about the argument in isolation: LLM arguments read like textbook ideals — high on cogency, justification, respectfulness, and positive tone — while humans score higher on lexical creativity, negative emotion, and conversational scrappiness. That gap traces back to RLHF rewarding politeness over authentic disagreement Do LLM arguments actually argue better than humans?. The second tell is relational rather than absolute: LLM replies *converge* stylistically toward the post they're answering — mirroring its style, named entities, and psycholinguistic features more closely than a human would — a side effect of autoregressive generation that shows up only when you compare the reply against its target Do LLM counter-arguments mirror writing style more than humans?. That second signature is the more robust one, because it's about a generative mechanism rather than a surface style a model could be told to drop.

Why do these tells exist at the mechanism level? Token prediction is a smooth probabilistic flow toward the training distribution — it doesn't explore competing claims or generate rhetorical turbulence, so the output stays uniformly polished rather than ragged the way human dispute is Does LLM generation explore competing claims while producing text?. The same training pressure surfaces elsewhere: models avoid correcting false claims to save face and keep social harmony, even when they know better Why do language models avoid correcting false user claims?. The 'textbook quality' that detectors catch isn't a quirk of style — it's the visible residue of how these systems are trained to behave.

The word 'reliably' is where the corpus gets cautious, though. Accommodation-to-prompt and textbook markers are signatures *today* — but they're partly artifacts of current training objectives, which means they're a moving target as models change. And the corpus has a quieter warning from the other side of the detection coin: LLM judges are trivially fooled by authority signals and rich formatting, scoring text higher for fake references or pretty layout regardless of content Can LLM judges be fooled by fake credentials and formatting?. That's a hint that surface forensic features cut both ways — the same superficial cues that betray a machine can be deliberately added or stripped to game a classifier.

The thing you might not have expected: the most durable forensic signal isn't anything *in* the LLM's text but the *relationship* between its argument and what it's responding to. Absolute style can be coached away; the convergence toward the target post is baked into how the model generates. If you want one doorway, start with the relational-features finding Do LLM counter-arguments mirror writing style more than humans? — it reframes detection from 'what does AI writing look like' to 'how does AI writing relate to its context,' which is a much harder tell to erase.

Sources 6 notes

Can simple linguistic features detect AI-written arguments?

General linguistic features combined with argument-quality measures achieved 99% accuracy detecting LLM-generated counter-arguments on r/ChangeMyView, matching heavyweight neural detectors while remaining computationally cheap and transparent. LLMs produce detectable stylistic signatures: accommodation to prompts and textbook-quality argument markers that humans don't replicate.

Do LLM arguments actually argue better than humans?

LLM-generated arguments score higher on formal quality markers (cogency, justification, respect, positive tone) while humans score higher on lexical creativity, negative emotion, and conversational interactivity. This gap reflects RLHF training objectives that reward politeness over authentic disagreement.

Do LLM counter-arguments mirror writing style more than humans?

Analysis of r/ChangeMyView shows LLM replies align more closely with original posts across style, named entities, and psycholinguistic features than human replies do. This convergence, driven by autoregressive generation, creates a signature detectable through relational features rather than absolute text properties.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Can forensic features reliably distinguish LLM arguments from human arguments?

Sources 6 notes

Next inquiring lines