Reinforcement Learning for LLMs

Can diversity optimization improve quality during language model training?

Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?

Note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Post-training of LLMs via RL typically prioritizes accuracy and helpfulness, which sharpens output distributions and reduces the range of ideas. This creates a tension: quality improves while diversity degrades, limiting usefulness for creative and exploratory tasks. The standard assumption is that quality and diversity trade off.

Diversity-Aware Reinforcement Learning (DARLING, 2025) challenges this assumption. It jointly optimizes for quality and semantic diversity during online RL by: (1) using a learned partition function to cluster rollouts into semantically distinct groups (beyond surface-level lexical variation), and (2) multiplying the diversity signal with the quality reward, amplifying the advantage for responses that are both high-quality and semantically novel.

The counter-intuitive finding: explicitly optimizing for diversity also improves quality. On five non-verifiable benchmarks (instruction following and creative writing), DARLING consistently produces outputs of both higher quality and higher novelty than quality-only RL baselines. On verifiable tasks (competition math), it achieves higher pass@1 (solution quality) and pass@k (solution variety).

The mechanism is exploration. Since Does policy entropy collapse limit reasoning performance in RL?, standard RL concentrates probability mass on a narrow set of high-reward trajectories. The diversity reward counteracts this: it forces the model to maintain exploration across semantically distinct solution strategies, which means it encounters more high-quality solutions that pure exploitation would never reach. Diversity is not just an output property — it is a training-time exploration signal.

This has direct implications for Does negative reinforcement alone outperform full reinforcement learning?. If negative reinforcement works by suppression, DARLING works by forced exploration — and the latter may produce broader capability because it explicitly rewards novel correct solutions rather than just penalizing known failures.

The learned semantic classifier is the key architectural innovation. Surface-level lexical diversity (different words) does not capture semantic diversity (different ideas). By training a classifier to recognize genuine conceptual distinctness, DARLING avoids the failure mode where the model produces lexically varied but semantically identical outputs.


Source: Reward Models — Jointly Reinforcing Diversity and Quality in Language Model Generations (arxiv 2509.02534)

Related concepts in this collection

Concept map
15 direct connections · 114 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

explicitly optimizing for semantic diversity during rl catalyzes exploration and simultaneously improves both quality and diversity