Can human judges detect AI writing through lexical patterns?
While AI text shows measurable differences from human writing across six lexical dimensions, judges—including experts—fail to identify AI authorship reliably. Why does perceptible quality diverge from measurable reality?
The lexical diversity study compared ChatGPT-generated text with human writing across six dimensions:
- Volume — total word count
- Abundance — richness of vocabulary
- Variety-repetition — ratio of unique to total words
- Evenness — distribution evenness across vocabulary
- Disparity — semantic distance between words used
- Dispersion — spread of vocabulary across text length
One-way MANOVAs confirm: LLM text differs significantly from human text on ALL six dimensions. The differences are statistically robust.
And yet: human judges in multiple studies — including applied linguists and NLP researchers — cannot reliably distinguish AI-generated from human-written text. This is not a new finding, but the combination with specific lexical diversity measurement is new: the differences are real and measurable, but they are the wrong kind for human perception. Human judges are apparently not attending to lexical diversity patterns when making authorship judgments.
This paradox has implications in multiple directions:
- For AI detection: current detection methods may need to move from lexical heuristics to distributional pattern analysis that explicitly targets these six dimensions
- For AI writing quality: "sounds human" and "is measurably human-like" are different targets; AI writing can satisfy the former while failing the latter
- For academic integrity: the gap between measurable and perceptible means that policy-level responses to AI writing cannot rely on human judgment as the detection mechanism
Source: Discourses
Related concepts in this collection
-
Why do newer AI models diverge further from human writing patterns?
As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?
the trend over model generations
-
Can humans detect AI writing if it looks natural?
Despite measurable differences in how AI generates text, human judges—even experts—consistently fail to identify it. This explores why perception lags behind measurement.
writing angle
-
Why do ChatGPT essays lack evaluative depth despite grammatical strength?
ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.
parallel finding from a different angle: structural differences invisible at surface, measurable analytically
-
Can we measure reading efficiency as a quality metric?
How can we quantify whether generated text delivers novel information efficiently or wastes reader attention through redundancy? This matters because standard coherence and fluency scores miss texts that are well-written but informationally dense.
complementary metric: lexical diversity tracks vocabulary variety; KD tracks information per token; both quantify measurable deficits that surface evaluation misses
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm text differs measurably from human text on lexical diversity but human judges cannot detect the differences