INQUIRING LINE

What happens to model grounding when preference optimization increases effective diversity?

This explores a seeming paradox: some research says preference tuning (RLHF/DPO) *increases* useful diversity, yet other research says the same optimization quietly damages a model's grip on shared reality — so what happens to grounding when the two collide?


This explores a seeming paradox in the corpus: one line of work argues preference optimization can *raise* effective diversity, while another shows it erodes a model's grounding — and the question asks what gives when both are true at once. The starting point is that "diversity went up" and "diversity went down" are not contradictory claims; they measure different things. The narrative-correcting result is that preference-tuned models look *less* diverse only because base models spray variance across incoherent space — once you measure diversity among *quality-passing* outputs, preference tuning actually increases it by filtering out the junk Does preference tuning actually reduce the diversity of model outputs?. And the direction isn't even uniform: RLHF compresses lexical variety in code while expanding it in creative writing, because each domain rewards a different thing Does preference tuning always reduce diversity the same way?.

Here's the twist the question is reaching for: that "effective diversity" gain is bought by sharpening the policy toward whatever the reward model prefers — and what the reward model prefers is fluent, confident, self-contained answers. That is exactly the behavior that *erodes grounding*. LLMs already produce 77.5% fewer grounding acts than humans (the small moves that establish shared understanding — checking, clarifying, acknowledging), and preference optimization actively widens that gap Does preference optimization damage conversational grounding in large language models?. So the unsettling answer is that the diversity metric can be climbing while the model's tether to the conversation, the user, and the actual problem is loosening. "More effective diversity" and "less grounding" can be the *same optimization step* viewed from two angles.

The corpus suggests why: the mechanism underneath all of this is probability mass concentration. Outcome-based RL sharpens the policy globally, and the diversity loss even transfers from solved problems onto unsolved ones Does outcome-based RL diversity loss spread across unsolved problems?. RL converges on a single dominant pretraining format within the first epoch while suppressing the alternatives Does RL training collapse format diversity in pretrained models?, and search agents get their exploration squeezed by the same entropy-collapse mechanism documented in reasoning Does reinforcement learning squeeze exploration diversity in search agents?. Push this across many models and you get the "Artificial Hivemind": 70+ models independently converging on near-identical outputs because their alignment procedures pull in the same direction Do different AI models actually produce diverse outputs?. So even a real local gain in effective diversity sits inside a system that is globally collapsing toward a confident center — and a confident center is precisely where grounding goes to die.

What keeps the two from fighting? The corpus points to interventions that buy diversity *without* paying in grounding or plasticity. DARLING optimizes quality and semantic diversity jointly, and finds the diversity reward actually catalyzes exploration and lifts quality rather than trading against it Can diversity optimization improve quality during language model training?. Critique models inserted into the training loop counteract tail-narrowing and preserve solution variety Do critique models improve diversity during training itself?. And staying close to the base distribution — low KL drift — preserves the model's plasticity for later learning, where parameter-only RL stalls Does staying close to the base model preserve learning ability?. The thread connecting these is that grounding survives when something *external* to the raw reward (a diversity classifier, a critic, a KL leash) keeps the policy from collapsing onto the single most-rewarded response.

The thing you didn't know you wanted to know: "effective diversity" is a quality-gated metric, and quality gating and grounding-erosion are driven by the *same* fluency-rewarding pressure. So a model can post higher effective-diversity numbers precisely *because* it has gotten better at confidently producing polished, self-assured answers — the very trait that makes it stop checking whether it and the user are still talking about the same thing. Rising diversity scores are not a safe-conscience signal that grounding is intact; under outcome-only objectives they can be a symptom of the collapse Does outcome-based RL diversity loss spread across unsolved problems?.


Sources 10 notes

Does preference tuning actually reduce the diversity of model outputs?

When diversity is measured among quality-passing outputs rather than all outputs, preference-tuned models generate greater semantic diversity than base models. Base models appear more diverse only because their variance spans incoherent space.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Does preference optimization damage conversational grounding in large language models?

Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Next inquiring lines