Why do newer AI models diverge further from human writing patterns?
As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?
The lexical diversity study compared ChatGPT-3.5, 4, o4-mini, and 4.5. The key finding: the newer models — o4-mini and 4.5 — differ most from human-written text on lexical diversity measures. They are the least human-like by measurable metric.
At the same time, human judges consistently fail to detect AI-generated text regardless of model version. More capable models don't become easier to detect; the failure of human judgment is stable across model generations.
ChatGPT-4.5 produces higher lexical diversity than older models despite generating fewer tokens — it is more lexically dense, but the density pattern is still non-human. The implication: newer models aren't converging on human-like writing by becoming better at mimicking human lexical patterns; they are becoming better at generating high-quality text that is nonetheless systematically different from human text.
This suggests that the training objective (RLHF, quality preference) is pushing models toward a different optimum than "human-like lexical diversity." The optimum models converge on is rated higher quality by human raters but is more measurably distinct from how humans naturally write.
The widening gap between measurable and perceptible has an important practical consequence: as models improve, naive human-based detection becomes less viable, not more. Detection requires moving to statistical/computational analysis that humans don't spontaneously perform.
Source: Discourses
Related concepts in this collection
-
Can human judges detect AI writing through lexical patterns?
While AI text shows measurable differences from human writing across six lexical dimensions, judges—including experts—fail to identify AI authorship reliably. Why does perceptible quality diverge from measurable reality?
the baseline paradox
-
Can humans detect AI writing if it looks natural?
Despite measurable differences in how AI generates text, human judges—even experts—consistently fail to identify it. This explores why perception lags behind measurement.
writing angle
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
newer llm generations diverge further from human lexical patterns while becoming harder to detect