INQUIRING LINE

Why does positive reinforcement degrade diversity at higher k values?

This explores why training a model on its own correct answers (positive reinforcement) hurts the diversity of solutions you can sample — specifically Pass@k, where you draw k attempts and check if any succeed.


This explores why rewarding a model for its correct answers — positive reinforcement — quietly erodes performance when you sample many attempts, not just one. The cleanest answer in the corpus is mechanical: positive-only reinforcement works by concentrating probability mass onto the trajectories that already succeed. At k=1 that looks like an improvement, because the single most-likely output is now more reliable. But Pass@k at higher k depends on the model still being *able* to produce many different correct paths. Once probability has been vacuumed onto a few winning trajectories, the long tail of alternative-but-valid solutions thins out, so drawing more samples stops buying you new ways to succeed. Does negative reinforcement alone outperform full reinforcement learning? makes the contrast sharp: training on *only* the wrong answers — pushing probability away from failures rather than toward winners — matches or beats full RL on Pass@k precisely because it suppresses bad trajectories without collapsing the spread of good ones.

The interesting twist is that this isn't a local effect on the problems you trained on. Does outcome-based RL diversity loss spread across unsolved problems? shows the sharpening is global — rewarding final-answer correctness concentrates the policy everywhere, so diversity also drains away on problems the model never solved and never got reward signal for. The model becomes more confident in general, including in places where confidence is exactly what you don't want.

This is the same phenomenon researchers call entropy collapse, and it shows up far outside math reasoning. Does reinforcement learning squeeze exploration diversity in search agents? documents the identical squeeze in search agents — policies converge on narrow reward-maximizing strategies — and notes that supervised fine-tuning on diverse demonstrations preserves the breadth that RL destroys. So the degradation isn't a quirk of one task; it's what scalar-reward maximization does by construction.

Why higher k specifically gets hit hardest connects to a deeper point: when you plan to sample many times or run search at inference, the *right* training objective changes. Should training maximize diversity when models feed into search? argues that a model feeding into evolutionary or repeated-sampling procedures should be trained to emit many competent-but-different solutions, because an entropy-collapsed policy literally cannot reach problems that require combining modes. Positive reinforcement optimizes the wrong thing for that regime — it maximizes the best single guess while quietly destroying the variety that high-k sampling exists to exploit.

The corpus also points to fixes that recover diversity without giving up quality, which is the part you might not know you wanted. Can reward vectors be the hidden source of solution diversity? keeps rewards unscalarized — decomposed per test-case or criterion — so solutions specialize along real trade-offs instead of collapsing to one. Can diversity optimization improve quality during language model training? adds a semantic-diversity reward and finds it *catalyzes* exploration, producing higher quality than quality-only training. And Do critique models improve diversity during training itself? shows step-level critique inside the training loop counteracts the tail-narrowing directly. One caveat worth holding: Does preference tuning always reduce diversity the same way? finds the direction isn't universal — reinforcement compresses diversity where the domain rewards convergence (code) but can expand it where the domain rewards distinctiveness (creative writing). The degradation at high k is what happens when your reward says 'one right answer,' which is most of reasoning — but not all of everything.


Sources 8 notes

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Should training maximize diversity when models feed into search?

Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.

Can reward vectors be the hidden source of solution diversity?

Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Next inquiring lines