How does diversity collapse during iterative self-improvement cycles?

This explores why the diversity of a model's outputs shrinks when it trains on its own outputs over repeated rounds — and what mechanisms drive that narrowing.

This explores why diversity collapses when a model learns from its own outputs across repeated rounds. The corpus frames it less as a single bug and more as a structural pull baked into how self-training rewards work: when a system keeps reinforcing what it already does well, probability mass concentrates and the tails of the distribution thin out. The clearest statement is that pure self-improvement is inherently circular — it stalls on the generation-verification gap, diversity collapse, and reward hacking, and the methods that actually keep improving quietly smuggle in an external anchor: an old model version, a third-party judge, user corrections, or tool feedback Can models reliably improve themselves without external feedback?. Without one of those outside signals, each cycle feeds on a narrower version of itself.

The sharpest mechanism is what happens with outcome-based reward. When you reward only the final answer being correct, the policy sharpens globally — it piles probability onto the winning trajectories for problems it already solves, and that sharpening *transfers*, dragging down diversity even on problems it hasn't solved yet Does outcome-based RL diversity loss spread across unsolved problems?. So the collapse isn't confined to the part of the space the model has mastered; it spreads. The same entropy-collapse pattern shows up in search agents, where RL squeezes exploration down to a few reward-maximizing strategies Does reinforcement learning squeeze exploration diversity in search agents?, and even at the level of surface format: within the first epoch, RL tends to amplify one dominant pretraining format and suppress the alternatives — and which format wins depends on model scale, not on which one performs best Does RL training collapse format diversity in pretrained models?. That last point is unsettling: the collapse can lock in an arbitrary winner.

What's worth knowing is that collapse isn't uniform — it depends on what the domain rewards. Preference tuning *reduces* lexical-syntactic diversity in code, where the incentive is to converge on the one correct solution, but *increases* it in creative writing, where the incentive rewards being distinctive Does preference tuning always reduce diversity the same way?. So 'diversity collapse' is really the default outcome of convergence-rewarding tasks, not an inevitable law of self-training.

The corpus also has a rich set of counter-mechanisms, which is the most useful part for someone trying to understand the failure. You can fight collapse during training by inserting critique: step-level critique in the training loop counteracts tail-narrowing and keeps solution diversity alive across self-training iterations — a benefit the corpus argues is more fundamental than the test-time accuracy bump Do critique models improve diversity during training itself?. You can reward diversity directly: jointly optimizing for quality and semantic diversity catalyzes exploration and actually yields *higher* quality than quality-only training Can diversity optimization improve quality during language model training?. Or you can change the shape of the reward itself — keeping rewards as unscalarized vectors (per test-case, per criterion, per persona) builds a natural diversity axis into training, letting solutions specialize along real task trade-offs instead of collapsing to one mode Can reward vectors be the hidden source of solution diversity?. This matters most when the model feeds into a search procedure at inference: an entropy-collapsed policy literally cannot reach problems that a diversity-trained one solves by exploring and recombining modes Should training maximize diversity when models feed into search?.

The through-line a curious reader might not expect: diversity collapse and the limits of self-improvement are the same phenomenon viewed from two angles. A model improving on its own outputs converges, convergence kills the exploration that future improvement depends on, and so the system runs out of road unless something from outside — a critic, an external judge, a vector-structured reward, or genuinely adaptive metacognition rather than a human-fixed loop Can AI systems improve their own learning strategies? — keeps reintroducing the variation the cycle keeps eating.

Sources 10 notes

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Can reward vectors be the hidden source of solution diversity?

Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.

Should training maximize diversity when models feed into search?

Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.

Can AI systems improve their own learning strategies?

Current self-improvement methods use extrinsic, fixed metacognitive loops designed by humans that fail under domain shift or capability changes. True self-improvement requires agents to generate their own adaptive metacognitive knowledge, planning, and evaluation—a gap confirmed as a neglected research area across neuro-symbolic AI.

How does diversity collapse during iterative self-improvement cycles?

Sources 10 notes

Next inquiring lines