Language Understanding and Pragmatics LLM Reasoning and Architecture Psychology and Social Cognition

Why do different LLMs generate nearly identical outputs?

Explores whether diversity in model architectures and training actually produces diverse ideas, or whether shared alignment procedures and training data cause convergence on similar responses.

Note · 2026-03-27 · sourced from Foundation Models
What kind of thing is an LLM really? Why do AI systems fail at social and cultural interpretation? What happens to social order when AI removes ritual constraints?

INFINITY-CHAT studied 70+ open and closed source LLMs across 26K real-world open-ended queries that admit a wide range of plausible answers with no single ground truth. The findings reveal a pronounced "Artificial Hivemind" effect characterized by two distinct phenomena:

  1. Intra-model repetition — a single model consistently generates similar responses to the same prompt across runs.
  2. Inter-model homogeneity — different models independently produce strikingly similar outputs, sometimes verbatim: DeepSeek-V3 and GPT-4o generated overlapping phrases like "Elevate your iPhone with our," "sleek, without compromising." In some cases, models from the same family output identical responses.

The inter-model effect is the more concerning finding. Model ensembles — using multiple different models to increase diversity — may not yield true diversity when their constituents share overlapping alignment and training priors. The convergence is not just stylistic but substantive: models converge on the same ideas, not just the same words.

This has direct implications for the False Punditry argument. Since Does polished AI output trick audiences into trusting it?, the hivemind effect means that AI-generated social media content will sound similar regardless of which model generates it. The "diversity" of AI voices on social media is illusory — different accounts using different models will produce strikingly similar analysis, framing, and conclusions, creating a false consensus that looks like independent agreement.

Since Why do LLMs generate novel ideas from narrow ranges?, the hivemind effect extends from research ideas to all open-ended generation. The diversity collapse documented in research ideation is a specific instance of a general phenomenon: LLMs trained on overlapping data with similar alignment procedures converge on a shared distribution of outputs.

Recommendation as a concrete domain instance. LLM-based conversational recommender systems exhibit the hivemind in a specific, measurable way: "the most popular items such as The Shawshank Redemption appear around 5% of the time" across different recommendation datasets, and "the recommended popular items are similar across different datasets, which may reflect the item popularity in the pre-training corpus of LLMs" (Large Language Models as Zero-Shot Conversational Recommenders). The convergence is not on quality or relevance but on pretraining-distribution popularity — the same items surface regardless of the user's context or the dataset's actual popularity distribution. This is the hivemind effect translated from open-ended generation to decision-making: LLMs don't just write the same things, they recommend the same things.

The study also found that reward models and LM-based judges are miscalibrated for responses that elicit divergent human preferences — they assume a single consensus notion of quality and fail to reward the pluralistic preferences that open-ended queries produce. This means the homogeneity is self-reinforcing: training on reward model scores optimizes for the consensus the hivemind already occupies.

NoveltyBench (2025) provides the first benchmark-level quantification of mode collapse across 20 leading models. Evaluating models on prompts curated to elicit diverse answers (using filtered real-world queries), the study finds that current SOTA systems "generate significantly less diversity than human writers." A counterintuitive finding: larger models within a family often exhibit LESS diversity than their smaller counterparts, directly challenging the assumption that capability on standard benchmarks translates to generative utility. While in-context regeneration prompting strategies can elicit some diversity, the findings reveal "a fundamental lack of distributional diversity" that reduces utility for users seeking varied responses. The mode collapse is driven by alignment: today's aligned models produce lower entropy distributions than earlier generations, and random sampling produces substantial near-duplicates. Source: Arxiv/Evaluations.


Source: Foundation Models Paper: Artificial Hivemind: The Open-Ended Homogeneity of Language Models

Related concepts in this collection

Concept map
19 direct connections · 175 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

different LLMs independently converge on similar outputs in open-ended generation — the artificial hivemind effect means model diversity does not produce idea diversity