Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Paper · arXiv 2510.22954 · Published October 27, 2025
Foundation ModelsEvaluationsAlignment

Large language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. To address this gap, we introduce INFINITY-CHAT, a largescale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., creative content generation, brainstorm & ideation) that further breaks down to 17 subcategories. Using INFINITY-CHAT, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. INFINITY-CHAT also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that state-of-the-art LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.

We introduce INFINITY-CHAT, a large-scale dataset of 26K real-world open-ended queries spanning diverse, naturally occurring prompts mined from WildChat [94]. These queries admit a wide range of plausible answers with no single correct response. We further develop the first comprehensive taxonomy of open-ended LM queries, encompassing 6 top-level categories (e.g., Brainstorm & Ideation, and less explored types such as Speculative & Hypothetical Scenarios, and Skill Development) and 17 subcategories grounded in natural chatbot-user interactions.

Using INFINITY-CHAT, we systematically study intra- and inter-model mode collapse across 70+ open and closed source LMs (25 detailed in the main paper). We uncover a pronounced Artificial Hivemind effect: (1) intra-model repetition, where a single model repeatedly generates similar outputs, and, more critically, (2) inter-model homogeneity, where different models independently converge on similar ideas with minor variations in phrasing. The latter warns that model ensembles may not yield true diversity when their constituents share overlapping alignment and training priors.

Beyond generative behaviors, we also examine whether LMs are calibrated to assess alternative responses of comparable quality to open-ended queries. To enable this study, we collect 31,250 human annotations on distinct model responses in INFINITY-CHAT, encompassing both absolute quality ratings and pairwise preferences, with dense annotations from 25 independent annotators per query–response pair. Our results show that LMs, reward models, and LM-based judges are often miscalibrated with respect to human ratings on responses that elicit divergent, idiosyncratic preferences among annotators despite comparable overall quality. This exposes key limitations in current modeling pipelines, which tend to assume a single, consensus notion of quality and thus overlook or fail to reward the diverse, pluralistic preferences that arise in open-ended responses.

Altogether, our work introduces a comprehensive framework for evaluating realistic open-endedness, diversity, and pluralistic alignment in LMs, both within and across LMs. By integrating real-world queries, a taxonomy of query types, and dense human annotations, INFINITY-CHAT provides a useful resource for diagnosing the Artificial Hivemind effect and for guiding the development of safer, more expressive, and more resourceful LMs that better empower human creativity.

For example, Figure 6 shows that DeepSeek-V3 and gpt-4o-2024-11-20 generate overlapping phrases like “Elevate your iPhone with our,” “sleek, without compromising,” and “with bold, eye-catching” in answer to the query “Create a description with 2-3 sentences for an iPhone case collection that is a slim-fitted case with bold designs.” In some cases, models output identical responses: for “Generate a motto for a social media page focused on successes, wealth, and self-help,” both qwen-max-2025-01-25 and qwen-plus-2025-01-25 generate “Empower Your Journey: Unlock Success, Build Wealth, Transform Yourself.” These instance-level verbatim overlaps illustrate the severity of the “Artificial Hivemind” effect across models.