Why do different LLMs generate nearly identical outputs?
Explores whether diversity in model architectures and training actually produces diverse ideas, or whether shared alignment procedures and training data cause convergence on similar responses.
INFINITY-CHAT studied 70+ open and closed source LLMs across 26K real-world open-ended queries that admit a wide range of plausible answers with no single ground truth. The findings reveal a pronounced "Artificial Hivemind" effect characterized by two distinct phenomena:
- Intra-model repetition — a single model consistently generates similar responses to the same prompt across runs.
- Inter-model homogeneity — different models independently produce strikingly similar outputs, sometimes verbatim: DeepSeek-V3 and GPT-4o generated overlapping phrases like "Elevate your iPhone with our," "sleek, without compromising." In some cases, models from the same family output identical responses.
The inter-model effect is the more concerning finding. Model ensembles — using multiple different models to increase diversity — may not yield true diversity when their constituents share overlapping alignment and training priors. The convergence is not just stylistic but substantive: models converge on the same ideas, not just the same words.
This has direct implications for the False Punditry argument. Since Does polished AI output trick audiences into trusting it?, the hivemind effect means that AI-generated social media content will sound similar regardless of which model generates it. The "diversity" of AI voices on social media is illusory — different accounts using different models will produce strikingly similar analysis, framing, and conclusions, creating a false consensus that looks like independent agreement.
Since Why do LLMs generate novel ideas from narrow ranges?, the hivemind effect extends from research ideas to all open-ended generation. The diversity collapse documented in research ideation is a specific instance of a general phenomenon: LLMs trained on overlapping data with similar alignment procedures converge on a shared distribution of outputs.
Recommendation as a concrete domain instance. LLM-based conversational recommender systems exhibit the hivemind in a specific, measurable way: "the most popular items such as The Shawshank Redemption appear around 5% of the time" across different recommendation datasets, and "the recommended popular items are similar across different datasets, which may reflect the item popularity in the pre-training corpus of LLMs" (Large Language Models as Zero-Shot Conversational Recommenders). The convergence is not on quality or relevance but on pretraining-distribution popularity — the same items surface regardless of the user's context or the dataset's actual popularity distribution. This is the hivemind effect translated from open-ended generation to decision-making: LLMs don't just write the same things, they recommend the same things.
The study also found that reward models and LM-based judges are miscalibrated for responses that elicit divergent human preferences — they assume a single consensus notion of quality and fail to reward the pluralistic preferences that open-ended queries produce. This means the homogeneity is self-reinforcing: training on reward model scores optimizes for the consensus the hivemind already occupies.
NoveltyBench (2025) provides the first benchmark-level quantification of mode collapse across 20 leading models. Evaluating models on prompts curated to elicit diverse answers (using filtered real-world queries), the study finds that current SOTA systems "generate significantly less diversity than human writers." A counterintuitive finding: larger models within a family often exhibit LESS diversity than their smaller counterparts, directly challenging the assumption that capability on standard benchmarks translates to generative utility. While in-context regeneration prompting strategies can elicit some diversity, the findings reveal "a fundamental lack of distributional diversity" that reduces utility for users seeking varied responses. The mode collapse is driven by alignment: today's aligned models produce lower entropy distributions than earlier generations, and random sampling produces substantial near-duplicates. Source: Arxiv/Evaluations.
Source: Foundation Models Paper: Artificial Hivemind: The Open-Ended Homogeneity of Language Models
Related concepts in this collection
-
Does polished AI output trick audiences into trusting it?
When AI generates professional-looking graphs, diagrams, and presentations, do audiences mistake visual polish for analytical depth? This matters because appearance might substitute for actual expertise.
hivemind makes all AI artifacts sound similar
-
Why do LLMs generate novel ideas from narrow ranges?
LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
research ideation collapse as specific instance of general hivemind
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
reward model miscalibration reinforces homogeneity
-
Why do multi-agent LLM systems converge without real debate?
When multiple AI agents reason together, do they genuinely deliberate or just accommodate each other's views? Research into clinical reasoning systems reveals how often agents reach agreement without substantive disagreement.
hivemind at generation level parallels silent agreement at reasoning level
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
different LLMs independently converge on similar outputs in open-ended generation — the artificial hivemind effect means model diversity does not produce idea diversity