INQUIRING LINE

How much does diversity training cost in single-shot pass@1 performance?

This explores the assumed tradeoff in the question — that training a model to produce varied outputs (diversity) must come at the expense of its best single-attempt accuracy (pass@1) — and asks how steep that tax is.


This reads the question as: if you train a model to stay diverse rather than collapse onto one favored answer, what does that cost you on a single shot? The corpus's most interesting move is to challenge the premise — much of it suggests the tradeoff is smaller, conditional, or even reversed compared to the folk assumption.

The baseline worry is real and well documented. Outcome-based RL — rewarding only the final correct answer — sharpens a policy globally, concentrating probability mass on winning trajectories and bleeding diversity even on problems the model hasn't solved yet Does outcome-based RL diversity loss spread across unsolved problems?. The same entropy-collapse mechanism shows up in search agents, where RL squeezes exploration breadth that SFT on varied demonstrations had preserved Does reinforcement learning squeeze exploration diversity in search agents?, and RL also quietly converges models onto a single dominant pretraining format within the first epoch, suppressing alternatives regardless of whether they performed better Does RL training collapse format diversity in pretrained models?. So the default direction of pressure is toward narrowing — which is exactly why people assume diversity must be bought back at a price.

But several notes argue the cost can be near-zero or negative. DARLING jointly optimizes for quality and semantic diversity and finds that diversity rewards *catalyze* exploration, producing higher-quality outputs than quality-only baselines on both creative and math tasks — diversity here pays for itself rather than taxing accuracy Can diversity optimization improve quality during language model training?. Critique models inserted into the training loop maintain solution diversity across self-training iterations and treat that as more fundamental than test-time accuracy, because preventing premature convergence keeps the model improving at all Do critique models improve diversity during training itself?. And when models feed into a search procedure at inference, training for varied competent solutions beats scalar optimization outright — an entropy-collapsed policy literally cannot reach problems that a diverse one solves Should training maximize diversity when models feed into search?.

The honest answer the corpus points to is that the cost is domain-dependent, not a fixed number. Preference tuning reduces lexical-syntactic diversity in code (where convergence toward the one correct solution is rewarded) but *increases* it in creative writing (where distinctiveness is the reward) Does preference tuning always reduce diversity the same way?. So in convergence-shaped domains, diversity and single-shot accuracy genuinely pull against each other; in open-ended ones they align. There's even a structural argument that the diversity you're protecting may be smaller than you think — different models independently converge on near-identical outputs (an "Artificial Hivemind"), so some apparent diversity loss is just surfacing a sameness that was already baked in Do different AI models actually produce diverse outputs?.

The thing worth taking away: the framing of "diversity vs. pass@1" mostly holds only when your reward collapses the policy in the first place. The corpus keeps finding that diversity loss and quality are governed by different mechanisms — historical/training-time exploration versus test-time batch exploration are structurally distinct Does outcome-based RL diversity loss spread across unsolved problems?, and multi-agent or role-specialized finetuning preserves diversity *and* keeps improving rather than overfitting into a single productive iteration Can multiple agents stay diverse during training together?. The cost isn't a tax you pay — it's a symptom of a reward design that didn't have to collapse the policy.


Sources 9 notes

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Should training maximize diversity when models feed into search?

Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Can multiple agents stay diverse during training together?

Training generation and critic agents on distinct role-dependent data prevents the overfitting collapse that limits single-agent finetuning to one productive iteration. Removing critics or summarization degrades performance, confirming both components are critical.

Next inquiring lines