Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —

Paper · arXiv 2504.12320 · Published April 10, 2025

from creative writing and survey responses to research idea generation (Doshi and Hauser, 2024; Anderson et al., 2024; Moon et al., 2024). For instance, stories written with ChatGPT assistance were more uniform than those generated independently by humans (Doshi and Hauser, 2024). Similarly, LLMauthored college essays contained fewer novel ideas than those written without LLM assistance (Moon et al., 2024). While these studies raise concerns about the creativity of LLMs, they typically focus on a single LLM, leaving unanswered the question of whether this homogeneity is unique to specific models or a broader phenomenon across different LLMs. Further, it is unclear whether LLMs have become more creative compared to 2023, when they became more widely known.

LLMs contribute to this development by offering “beyond-human” information processing capacities. Their ability to combine vast knowledge with stylistic variation enables outputs that are perceived as original and inventive (Sinha et al., 2023; Soroush et al., 2025). When effectively used, LLMs support both fluency and originality in human output, allowing individuals to move more fluidly through ideation and elaboration phases (Doshi and Hauser, 2024; Heyman et al., 2024). They can also contribute in ways that exceed human capacity for complexity, suggesting non-intuitive solutions—as evidenced in research examples like plane coloring in mathematics (Mundinger et al., 2024), protein structures (Varadi and Velankar, 2023) and autonomous driving (Atakishiyev et al., 2024).

However, these benefits are accompanied by certain limitations. Longitudinal research indicates that the output of GenAI systems tends toward aesthetic homogeneity, with content appearing increasingly similar across instances and users (Zhou and Lee, 2024). Moreover, AI-generated content often lacks the element of human surprise and subjectivity, offering elaborated yet convergent ideas that are well-structured but less likely to deviate from known paths (Song et al., 2025; Doshi and Hauser, 2024). In this regard, GenAI systems tend to reinforce structured problem-solving rather than facilitating radical or conceptual creativity. They excel at combinatorial creativity—that is, remixing and synthesizing existing knowledge—but still struggle with producing conceptual leaps or paradigm-shifting insights (Soroush et al., 2025; Orrù et al., 2023). Accordingly, human input remains indispensable, not only for refining and evaluating the ideas produced but also for grounding them in contextually meaningful ways (Lazar et al., 2022; Runco, 2023).

The most promising outcomes arise when humans and AI engage in mutual exploration. In such settings, the human creator defines the problem space and direction, the AI contributes associative or alternative paths, and both iterate toward refined and context-sensitive creative solutions (Haase and Pokutta, 2024). This dynamic reflects well-established principles of team-based human creativity (Paulus et al., 2012), and its effectivenessRegarding the first research question, we found that GPT-4o –previously benchmarked in 2023 as

GPT-4– performed substantially worse on the Divergent Association Task (DAT) but retained its performance on the Alternative Uses Task (AUT). Even for the AUT, however, only 0.28% of responses reached the 90th percentile. In other words, highly creative responses remain rare, and humans are still approximately 35.7 times more likely to produce such standout ideas. This finding offers one possible explanation for the increasingly documented trend toward homogenization in LLM-assisted output (Doshi and Hauser, 2024; Moon et al., 2024; Anderson et al., 2024). While LLMs may generate text that appears individually novel, they often lack the type of originality required to break into the top decile of human creativity.

Prompt design emerged as another significant modulator of performance. We found that merely disclosing the creative test context (e.g., mentioning the DAT) influenced LLM performance in model-specific ways – improving results for some models (e.g., Claude 3.5 and Grok 2), while worsening performance for DeepSeek R1 Distill Qwen 7B. This aligns with recent findings showing that LLMs are sensitive to goal framing and task specification (Memmert et al., 2024a), and echoing human creativity research on priming effects, where some individuals perform better when prompted to “be creative” (Sassenberg and Moskowitz, 2005; Acar et al., 2020). The implication is that creativity in LLMs may, in part, be prompt-contingent – a result of interaction dynamics rather than an inherent capacity.

Our findings also speak to the larger philosophical debate on artificial creativity. Critics argue that LLMs merely remix existing data, lacking the emotional depth, intentionality, or conceptual leaps characteristic of human creativity (Runco, 2023; Cropley and Cropley, 2023). Indeed, the absence of high-end originality in LLM output could be taken as support for this view. However, we caution against such binary thinking. While LLMs may not engage in creative processes in the human sense, their ability to generate outputs that score above the average human in both semantic divergence and usefulness indicates a form of functional or output-based creativity. Beyond all critical analysis discussed so far, 80% of LLMs AUT-output is on average better than that of humans – when specifically picking a “creative LLM”,

Still, as our findings show, such systems may encourage mid-level novelty but rarely produce radically original ideas –thus reinforcing combinatorial rather than conceptual creativity (Soroush et al., 2025; Orrù et al., 2023). Without thoughtful human oversight and critical engagement, GenAI may unintentionally constrain creative diversity and reinforce existing patterns, rather than expanding the overall human-AI-enhanced creative process.