Jointly Reinforcing Diversity and Quality in Language Model Generations

Paper · arXiv 2509.02534 · Published September 2, 2025

Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (Darling), a framework that jointly optimizes for response quality and semantic diversity. At its core, Darling introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that Darling generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, Darling consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, it achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.

To address the loss of diversity during LM post-training, we propose Diversity-Aware Reinforcement Learning (Darling), an online RL objective that (a) measures diversity at the semantic level via a learned classifier, and (b) fuses diversity and quality to condition gradient updates on “usefully different” trajectories. As illustrated in Figure 1, Darling first partitions rollouts from a single user prompt into distinct semantic clusters using a semantic classifier, capturing diversity beyond superficial lexical differences (§3.1). It then combines (multiplies) the diversity assessment with a quality reward, amplifying the advantage of log-probabilities for responses that are both high-quality and semantically diverse (§3.2).

We validate Darling’s effectiveness and generalizability across both non-verifiable and verifiable tasks, using various language model families and sizes. Experimental results demonstrate that Darling preserves the original model’s diversity and achieves improved benchmark performance in both non-verifiable instruction following and creative writing tasks, as well as verifiable math problems.

In summary, our contributions are three-fold:

(1) We propose Darling, an RL framework that simultaneously optimizes quality and diversity, preventing diversity collapse during post-training.

(2) We demonstrate that a learned semantic classifier can serve as a scalable and generalizable signal of diversity to integrate into online RL training.

(3) We show that explicitly optimizing for diversity promotes greater exploration, often leading to improvements in quality in both non-verifiable (creative writing) and verifiable (competition math) benchmarks.