How do quality thresholds change which model produces more usable diversity?
This explores a measurement trap: whether you put a quality filter in front of 'diversity' decides which model looks more diverse — and the ranking can flip depending on where you set the bar.
This explores a measurement trap: whether you count all of a model's outputs or only the ones that clear a quality bar changes the answer to 'which model is more diverse,' and the two answers can point at different models. The corpus's sharpest claim is that base models only *appear* more diverse because their variance spreads into incoherent space — once you measure diversity among quality-passing outputs instead of all outputs, preference-tuned models generate *more* semantic diversity, not less Does preference tuning actually reduce the diversity of model outputs?. So the threshold isn't a detail of the experiment; it's the thing that decides who wins. Set the bar at zero and the noisy model looks creative. Raise it and the disciplined model pulls ahead.
This reframes a finding that otherwise seems to contradict it. Smaller models around 500M parameters produce more unique outputs per sample, because larger models concentrate probability mass on their preferred answers Why aren't bigger models better for generating diverse outputs?. But 'unique per sample' is raw uniqueness with no quality gate — exactly the metric that flatters incoherent variance. The two notes aren't in conflict; they're measuring at different thresholds. The reader's takeaway: 'more diverse' is meaningless without naming the quality floor you measured above.
The deeper reason this matters is that quality, diversity, and complexity are not one axis — they drive different things downstream. Quality drives in-distribution generalization, diversity drives out-of-distribution generalization, and most evaluation collapses all three into a single quality score, which is precisely how self-improvement loops quietly bleed out their diversity How do quality, diversity, and complexity affect synthetic data differently?. A single threshold that conflates these will systematically pick the wrong model for whichever job you actually care about.
The corpus also says the threshold question is domain-dependent, not universal. The same preference tuning that compresses diversity in code (where the reward is converging on the correct solution) expands it in creative writing (where the reward is standing out) Does preference tuning always reduce diversity the same way?. So 'usable diversity above threshold' has a different shape per domain — the bar that filters helpfully for code filters harmfully for prose. And rather than treat quality and diversity as a trade-off you tune a threshold to balance, one line of work optimizes both jointly: a learned classifier rewards semantic diversity *during* RL and finds the diversity pressure actually catalyzes higher-quality outputs than quality-only training Can diversity optimization improve quality during language model training?.
The thing you didn't know you wanted to know: the reason ensembling many models for diversity disappoints is the same threshold logic at population scale. Across 70+ models and 26K open-ended queries, models independently converge on near-identical answers — an 'Artificial Hivemind' from shared training data and alignment Do different AI models actually produce diverse outputs?. Above a usability threshold, the diversity between models collapses too, not just within one. So 'which model produces more usable diversity' may, past a high enough bar, have the deflating answer: barely any of them, and barely differently.
Sources 6 notes
When diversity is measured among quality-passing outputs rather than all outputs, preference-tuned models generate greater semantic diversity than base models. Base models appear more diverse only because their variance spans incoherent space.
Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.
Quality drives in-distribution generalization, diversity enables out-of-distribution generalization, and complexity strengthens both. Current evaluation methods collapse these into a single quality metric, causing self-improvement loops to degrade through irreversible diversity loss.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.