What alignment procedures cause different models to share the same output distribution?
This explores why models from different labs, trained separately, end up producing nearly the same answers — and which steps in alignment (post-training) are the culprits.
This explores why models from different labs, trained separately, end up producing nearly the same answers — and which steps in alignment (post-training) are the culprits. The corpus points to a convergence story with several reinforcing causes, not one. The headline evidence is the "Artificial Hivemind" finding: across 70+ models and 26K open-ended queries, different LLMs independently generate strikingly similar or even identical responses, and the paper attributes this to overlapping training data plus shared alignment procedures Do different AI models actually produce diverse outputs?. So the question's premise holds — the interesting part is the mechanism.
One mechanism is what reinforcement-learning post-training actually does to the output distribution. Rather than teaching genuinely new behavior, RL tends to amplify a single dominant format already latent in pretraining while suppressing the alternatives — and it does this within the first epoch Does RL training collapse format diversity in pretrained models?. If every lab runs a similar RL stage over models pretrained on heavily overlapping web text, they're all collapsing toward the same favored format. This pairs with the LIMA result that alignment fine-tuning mostly *activates* capabilities the base model already has rather than installing new ones — so post-training is more of a stylistic selector than a source of divergence Can careful curation replace massive alignment datasets?.
There's also a self-reinforcing loop worth knowing about: aligned models can synthesize their own instruction data from nothing but formatting tokens, and that synthetic data matches human-curated sets in quality Can aligned LLMs generate their own training data?. When labs train on data generated by already-aligned models, the same distributional fingerprints propagate across the ecosystem — a homogenizing feedback loop on top of the shared web corpus.
The deeper layer is that this convergence is partly baked in below the alignment stage. Models respond to *corpus frequency*, not meaning — higher-frequency phrasings win regardless of semantics Why do semantically identical prompts produce different LLM outputs? — so models trained on the same statistical mass will gravitate to the same high-probability outputs before alignment even begins. What's striking, then, is the contrast in *where* tuning leaves its mark: proxy-tuning at decoding time shifts mainly reasoning and style while leaving stored knowledge intact, whereas direct weight fine-tuning corrupts knowledge in lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?. That hints at a lever — if convergence is driven by a thin distributional shift applied at alignment time, intervening at decoding rather than in the weights could preserve more diversity. And methods like consistency training, which deliberately train models to answer *identically* across prompt variants, show the field sometimes engineers homogeneity on purpose Can models learn to ignore irrelevant prompt changes?. The thing you may not have expected: the diversity you'd hope to get from ensembling several models is largely an illusion, because the alignment recipe is doing the same selection job in all of them.
Sources 7 notes
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.
MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.
Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.