Why do more capable language models benefit more from diversity elicitation?

This explores why prompting a model to produce varied or exploratory outputs pays off more for stronger models than weak ones — and the corpus suggests it's because capability sets the size of the latent pool you're drawing from.

This explores why prompting a model to produce varied or exploratory outputs pays off more for stronger models than weak ones. The short version the corpus points to: diversity elicitation doesn't *create* range, it *unlocks* range that's already latent in the weights — so a model only benefits to the extent it had untapped variety to begin with. A useful starting frame is the finding that prompting can only reorganize what a model already knows, never inject what it lacks Can prompt optimization teach models knowledge they lack?. Eliciting diversity is the same move pointed at the output distribution rather than at knowledge: it's a retrieval lever, and the lever has nothing to grab in a model whose distribution is thin.

Why would a more capable model have a richer distribution to draw from? One striking piece of the picture is that LLMs don't commit to a single answer or persona — they hold a superposition of consistent possibilities and *sample* from it at generation time, so regenerating the same prompt yields genuinely different, each-internally-coherent outputs Do large language models actually commit to a single character?. Diversity elicitation widens the sampling from that superposition. A bigger, better-trained model has a denser, more populated superposition; a smaller one has fewer distinct modes to sample, so turning up the diversity dial just reshuffles a small set.

The capacity-threshold evidence makes this concrete. On argument-scheme classification, smaller models plateau no matter how you prompt them, while only larger models cross the performance line once given few-shot examples and descriptions — a representational-capacity threshold below which extra scaffolding does nothing Can large language models classify argument schemes reliably?. That's the same shape as diversity elicitation: the prompt-side intervention only converts into gains once the model has enough internal structure for it to act on. And the failure modes that scaling doesn't fix — systematic linguistic blind spots that worsen with structural complexity Why do large language models fail at complex linguistic tasks?, or low-probability tasks that stay hard because of the autoregressive objective itself Can we predict where language models will fail? — mark the regions where no amount of elicitation helps, because the latent material isn't there.

There's a sharp counterweight worth knowing about, though. Across 70+ models on 26K open-ended queries, researchers found an "Artificial Hivemind": different models independently converge on near-identical outputs because of overlapping training data and shared alignment procedures Do different AI models actually produce diverse outputs?. So capability and diversity can pull *against* each other — the very alignment that makes a model more capable and well-behaved also flattens its output variety. This is why diversity has to be optimized for explicitly rather than assumed: DARLING rewards semantic diversity jointly with quality during RL and finds that the diversity pressure actually *catalyzes* exploration and produces higher-quality answers, not just more varied ones Can diversity optimization improve quality during language model training?.

Put together, the corpus reframes the question: more capable models benefit more from diversity elicitation because capability is what stocks the latent distribution you're eliciting from — but only if alignment hasn't already collapsed that distribution into a single safe mode. The interesting open tension is that the two forces driving capability (scale and alignment) push the diversity payoff in opposite directions, which is exactly why the most reliable gains come from training that *targets* diversity rather than prompting that merely hopes to surface it.

Sources 7 notes

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Can large language models classify argument schemes reliably?

Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Why do more capable language models benefit more from diversity elicitation?

Sources 7 notes

Next inquiring lines