Can human judges detect AI writing through lexical patterns?

While AI text shows measurable differences from human writing across six lexical dimensions, judges—including experts—fail to identify AI authorship reliably. Why does perceptible quality diverge from measurable reality?

Note · 2026-02-21 · sourced from Discourses

The lexical diversity study compared ChatGPT-generated text with human writing across six dimensions:

Volume — total word count
Abundance — richness of vocabulary
Variety-repetition — ratio of unique to total words
Evenness — distribution evenness across vocabulary
Disparity — semantic distance between words used
Dispersion — spread of vocabulary across text length

One-way MANOVAs confirm: LLM text differs significantly from human text on ALL six dimensions. The differences are statistically robust.

And yet: human judges in multiple studies — including applied linguists and NLP researchers — cannot reliably distinguish AI-generated from human-written text. This is not a new finding, but the combination with specific lexical diversity measurement is new: the differences are real and measurable, but they are the wrong kind for human perception. Human judges are apparently not attending to lexical diversity patterns when making authorship judgments.

This paradox has implications in multiple directions:

For AI detection: current detection methods may need to move from lexical heuristics to distributional pattern analysis that explicitly targets these six dimensions
For AI writing quality: "sounds human" and "is measurably human-like" are different targets; AI writing can satisfy the former while failing the latter
For academic integrity: the gap between measurable and perceptible means that policy-level responses to AI writing cannot rely on human judgment as the detection mechanism

Source: Discourses

Related concepts in this collection

Why do newer AI models diverge further from human writing patterns? As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?
the trend over model generations
Can humans detect AI writing if it looks natural? Despite measurable differences in how AI generates text, human judges—even experts—consistently fail to identify it. This explores why perception lags behind measurement.
writing angle
Why do ChatGPT essays lack evaluative depth despite grammatical strength? ChatGPT writes grammatically coherent academic prose but uses fewer evaluative and evidential nouns than student writers. The question explores whether this rhetorical gap—favoring description over argument—reflects a fundamental limitation in how LLMs approach academic writing.
parallel finding from a different angle: structural differences invisible at surface, measurable analytically
Can we measure reading efficiency as a quality metric? How can we quantify whether generated text delivers novel information efficiently or wastes reader attention through redundancy? This matters because standard coherence and fluency scores miss texts that are well-written but informationally dense.
complementary metric: lexical diversity tracks vocabulary variety; KD tracks information per token; both quantify measurable deficits that surface evaluation misses

Concept map

17 direct connections · 128 in 2-hop network ·medium cluster

Can human judges detect AI writing through lexic… Why do newer AI models diverge further from human … Can humans detect AI writing if it looks natural? Why do ChatGPT essays lack evaluative depth despit… Can we measure reading efficiency as a quality met…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

llm text differs measurably from human text on lexical diversity but human judges cannot detect the differences