How does probability mass concentration affect sampling diversity across model scales?
This explores why models that pile their probability onto a few favored outputs generate less varied samples — and how that tendency tracks (or doesn't) with model size.
This explores why models that pile their probability onto a few favored outputs generate less varied samples, and whether bigger models do this more. The most direct answer in the corpus is counterintuitive: bigger is not better for diversity. For synthetic data generation, models around 500M parameters produce *more* unique outputs per sample than larger ones, because larger models concentrate probability mass on their preferred completions — within a fixed sampling budget, that sharpness costs you variety Why aren't bigger models better for generating diverse outputs?. So concentration and scale interact in a way that punishes the assumption that a more capable model is also a more inventive sampler.
But scale isn't the only axis that controls where the mass lands — training does too, and often more decisively. Reinforcement learning that rewards only final-answer correctness sharpens the policy globally, concentrating mass on winning trajectories and draining diversity even on problems the model hasn't solved yet Does outcome-based RL diversity loss spread across unsolved problems?. The same entropy-collapse mechanism shows up in search agents, where RL squeezes exploration while supervised fine-tuning on diverse demonstrations preserves it Does reinforcement learning squeeze exploration diversity in search agents?. Interestingly, scale resurfaces here as a hidden variable: when RL collapses a model's many pretraining formats down to one dominant format, which format wins depends on model scale rather than on performance Does RL training collapse format diversity in pretrained models?. Concentration is happening, but where the peak forms is scale-dependent and largely invisible when you start from a proprietary base.
The corpus also pushes back on the simple "concentration = bad" story. If you measure diversity only among outputs that pass a quality bar, preference-tuned models turn out *more* semantically diverse than base models — base models just looked diverse because their spread covered incoherent, low-quality space Does preference tuning actually reduce the diversity of model outputs?. Whether tuning helps or hurts also depends on domain: it reduces lexical-syntactic variety in code (where convergence on correctness is the point) but increases it in creative writing Does preference tuning always reduce diversity the same way?. So "concentration" can mean pruning garbage or it can mean homogenization — the same mechanism, opposite value.
What makes this more than a per-model curiosity is that concentration converges *across* models. Analysis of 70+ models on 26K open-ended queries found an "Artificial Hivemind": different models independently land on near-identical responses, because overlapping training data and shared alignment procedures sculpt their probability mass into the same shape — quietly undermining the diversity you'd hope to get from ensembling across scales and vendors Do different AI models actually produce diverse outputs?. And the stakes compound over time: in self-improvement loops, diversity is what enables out-of-distribution generalization, and once it's lost the degradation is irreversible How do quality, diversity, and complexity affect synthetic data differently?. The thing you didn't know you wanted to know: the surest route to genuine sample diversity may be a *smaller*, lightly-tuned model, not a larger, heavily-aligned one.
Sources 8 notes
Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.
RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
When diversity is measured among quality-passing outputs rather than all outputs, preference-tuned models generate greater semantic diversity than base models. Base models appear more diverse only because their variance spans incoherent space.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.
Quality drives in-distribution generalization, diversity enables out-of-distribution generalization, and complexity strengthens both. Current evaluation methods collapse these into a single quality metric, causing self-improvement loops to degrade through irreversible diversity loss.