Can diversity optimization improve quality during language model training?
Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
Post-training of LLMs via RL typically prioritizes accuracy and helpfulness, which sharpens output distributions and reduces the range of ideas. This creates a tension: quality improves while diversity degrades, limiting usefulness for creative and exploratory tasks. The standard assumption is that quality and diversity trade off.
Diversity-Aware Reinforcement Learning (DARLING, 2025) challenges this assumption. It jointly optimizes for quality and semantic diversity during online RL by: (1) using a learned partition function to cluster rollouts into semantically distinct groups (beyond surface-level lexical variation), and (2) multiplying the diversity signal with the quality reward, amplifying the advantage for responses that are both high-quality and semantically novel.
The counter-intuitive finding: explicitly optimizing for diversity also improves quality. On five non-verifiable benchmarks (instruction following and creative writing), DARLING consistently produces outputs of both higher quality and higher novelty than quality-only RL baselines. On verifiable tasks (competition math), it achieves higher pass@1 (solution quality) and pass@k (solution variety).
The mechanism is exploration. Since Does policy entropy collapse limit reasoning performance in RL?, standard RL concentrates probability mass on a narrow set of high-reward trajectories. The diversity reward counteracts this: it forces the model to maintain exploration across semantically distinct solution strategies, which means it encounters more high-quality solutions that pure exploitation would never reach. Diversity is not just an output property — it is a training-time exploration signal.
This has direct implications for Does negative reinforcement alone outperform full reinforcement learning?. If negative reinforcement works by suppression, DARLING works by forced exploration — and the latter may produce broader capability because it explicitly rewards novel correct solutions rather than just penalizing known failures.
The learned semantic classifier is the key architectural innovation. Surface-level lexical diversity (different words) does not capture semantic diversity (different ideas). By training a classifier to recognize genuine conceptual distinctness, DARLING avoids the failure mode where the model produces lexically varied but semantically identical outputs.
Source: Reward Models — Jointly Reinforcing Diversity and Quality in Language Model Generations (arxiv 2509.02534)
Related concepts in this collection
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
DARLING directly addresses entropy collapse via diversity reward; the mechanism is forced exploration
-
Does negative reinforcement alone outperform full reinforcement learning?
Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.
complementary mechanism: suppression vs. forced exploration
-
Why do LLMs generate novel ideas from narrow ranges?
LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
DARLING's approach could address research ideation collapse by optimizing for semantic diversity during generation
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
DARLING addresses the training-time side by maintaining exploration diversity
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
explicitly optimizing for semantic diversity during rl catalyzes exploration and simultaneously improves both quality and diversity