Do LLMs produce texts with "human-like" lexical diversity?

Paper · arXiv 2508.00086 · Published July 31, 2025

The degree to which LLMs produce writing that is truly human-like remains unclear despite the extensive empirical attention that this question has received. The present study addresses this question from the perspective of lexical diversity. Specifically, the study investigates patterns of lexical diversity in LLM-generated texts from four ChatGPT models (3.5, 4, o4 mini, and 4.5) in comparison with texts written by L1 and L2 English participants (n = 240) across four education levels. Six dimensions of lexical diversity were measured in each text: volume, abundance, variety-repetition, evenness, disparity, and dispersion. Results from one-way MANOVAs, one-way ANOVAs, and Support Vector Machines revealed that the LLM-generated texts differed significantly from human-written texts for each variable, with ChatGPT-o4 mini and -4.5 differing the most. Within these two groups, ChatGPT-4.5 demonstrated higher levels of lexical diversity than older models despite producing fewer tokens. The human writers’ lexical diversity did not differ across subgroups (i.e., education, language status). Altogether, the results indicate that LLMs do not produce human-like texts in relation to lexical diversity, and the newer LLMs produce less human-like texts than older models. We discuss the implications of these results for language pedagogy and related applications.

Human judges' ability to differentiate between human- and GenAI-produced writing

Researchers have investigated whether humans can successfully discern whether a given text was written by a human or a machine. Köbis and Mossink (2021) asked human judges to try to distinguish between poetry written by humans versus an early GPT model (GPT-2); the participants were not generally able to do so. Casal and Kessler (2023) examined whether applied linguists could reliably identify whether abstracts for four different studies were human- or GenAI-written. Again, even highly-trained applied linguists were not successful in discerning authorship. Yeadon et al. (2024) asked participants to judge whether short-essay responses to physics questions were GenAI- or human-written. Here, too, participants were unsuccessful in correctly identifying GenAI-generated texts. More recently, Wen and Laporte (2025) asked human raters to try to distinguish between experiential narratives written by humans versus two GenAI models (ChatGPT-3.5 and -4.0). Though ChatGPT-3.5 and -4.0 produced texts that were respectively less and more lexically diverse than human-written texts, the participants were still unable to accurately distinguish between the GenAI- and human-written narratives.