Can simple linguistic features detect AI-written arguments?
Can interpretable linguistic patterns reliably distinguish LLM-generated counter-arguments from human-written ones in persuasive contexts? This matters because simple, auditable detection might outperform expensive neural approaches.
A combination of general-purpose linguistic features (lexical richness, syntactic complexity, type-token ratios) and argument-quality features (logical soundness, justification, engagement strategy) detects LLM-generated counter-arguments on r/ChangeMyView with nearly 99% accuracy. The features are interpretable — they name what they detect — and the detector is computationally cheap. External benchmark tests show this lightweight method performs comparably to heavyweight neural detectors in generalized detection scenarios.
The methodological point matters more than the accuracy number. Detection research has trended toward black-box classifiers — fine-tuned transformers that produce a yes/no without an explanation. The CMV result is the inverse: pick the right interpretable features and you get equivalent performance for a fraction of the compute, with the audit trail built in. The features are what does the work; the classifier is a wrapper.
The detection holds for one specific context — persuasive counter-arguments on CMV — and the authors are careful to flag the open questions: how does prompt design affect detectability, how does task type interact with the feature signature, how do these features behave under adversarial paraphrase. The 99% number is a ceiling for a specific genre, not a universal claim about LLM detection.
The forensic implication is the durable part. As long as LLM production mechanisms differ structurally from human production — stylistic mirroring of prompts, higher emotional positivity, textbook-quality argument markers — interpretable feature-based detection will find a target. Robust evasion would require LLMs to produce text whose features are human-like, not merely text whose content is convincing. That is a much harder optimization problem than current LLM training optimizes for.
Related concepts in this collection
-
Do LLM counter-arguments mirror writing style more than humans?
When language models generate arguments against social media posts, do they unconsciously adopt the stylistic features of what they're arguing against? This matters because it could reveal a detectable pattern that distinguishes LLM-written rebuttals from human-written ones.
one of the discriminating features
-
Do LLM arguments actually argue better than humans?
LLM counter-arguments score higher on textbook quality markers like logical soundness and respectful tone, while human arguments show more creativity and emotional intensity. What does this gap reveal about how we measure argumentative quality?
the other discriminating axis
-
Do LLMs and humans persuade through the same mechanisms?
If LLM and human arguments achieve equal persuasive force, does that mean they work the same way? This explores whether equivalent outcomes hide fundamentally different rhetorical strategies.
explains why interpretable features work: equivalent persuasion arises from different production processes
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
lightweight interpretable linguistic features achieve 99 percent accuracy detecting LLM-generated counter-arguments in persuasive discourse