How does tokenization toward corpus mean affect downstream output diversity?

This explores how models pulled toward high-frequency, corpus-average forms at generation time end up flattening the variety of what they produce — and the corpus reframes it as a convergence problem that starts at the input, not just the output.

This reads the question as: when a model's next-token machinery keeps steering toward whatever the training corpus saw most often, what happens to the range of things it can say? The corpus suggests the pull toward the corpus mean is not a quirk of one model — it's a shared gravity well that different systems fall into independently. INFINITY-CHAT's study of 70+ models found an "Artificial Hivemind": ask thousands of open-ended questions and the models converge on strikingly similar or identical answers, because overlapping training data and alignment procedures point them all at the same high-probability center Do different AI models actually produce diverse outputs?. So the diversity you'd hope to get from ensembling many models partly evaporates — they're all leaning on the same statistical mass.

The more surprising part is that this homogenization begins before generation, on the input side. Adam's Law describes a "high-frequency channel": the same distributional bias that makes a model accurate on common phrasings also filters out distinctiveness, because users iteratively rephrase their prompts toward the higher-frequency forms the model handles best Does high-frequency text homogenize user input before generation?. Related work shows that two prompts meaning exactly the same thing produce systematically different output quality depending on how frequent their phrasing is in pre-training — the model registers statistical mass, not meaning, so "paraphrase equivalence" is a fiction Why do semantically identical prompts produce different LLM outputs?. Diversity gets squeezed at both ends: distinct inputs get flattened toward common forms, then common forms get continued toward common outputs.

Why does the output stay smooth rather than branching? Because token prediction is trained to continue toward the training distribution, not to explore competing positions. One note frames generation as a "smooth probabilistic flow" rather than a turbulent exploration — the process never veers into logically related counter-views, so claims multiply without generating genuinely new perspectives Does LLM generation explore competing claims while producing text?. When the prompt is underspecified, the same dynamic produces generic answers: the model defaults to blended training-data priors, a "context collapse" that comes from missing scaffolding rather than any failure to understand Why do large language models produce generic responses to vague queries?.

Here's the thread you might not expect: the diversity is latent, not absent. Shanahan's 20-questions test shows a model holds a superposition of many consistent answers and *samples* one at generation time — regenerate and you get a different, equally-consistent response, proving no fixed commitment underneath Do large language models actually commit to a single character?. So the variety exists in the distribution; what collapses it is the steady pull toward the high-frequency center plus low-temperature, alignment-shaped decoding. The practical lever, the corpus implies, is less about the tokenizer itself and more about resisting that pull — richer contextual scaffolding on the input side, and sampling that doesn't always snap back to the mean — because the model isn't incapable of distinctiveness, it's biased toward the average.

Sources 6 notes

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Does high-frequency text homogenize user input before generation?

Adam's Law shows LLMs flatten distinct prompts at comprehension time as users rephrase toward higher-frequency forms the model handles best. The same distributional property that creates accuracy on common tasks filters out distinctiveness on the input side.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Why do large language models produce generic responses to vague queries?

Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

How does tokenization toward corpus mean affect downstream output diversity?

Sources 6 notes

Next inquiring lines