Evaluating the Diversity and Quality of LLM Generated Content
Recent work suggests that preference-tuning techniques—such as Reinforcement Learning from Human Preferences (RLHF) methods like PPO and GRPO, as well as alternatives like DPO—reduce diversity, creating a dilemma given that these models are widely deployed in applications requiring varied outputs. We argue that diversity without consideration of quality has limited practical value. To address this issue, we introduce a framework for measuring effective semantic diversity—diversity among outputs that meet quality thresholds—which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: when using diversity metrics that do not explicitly consider quality, preference-tuned models—particularly those trained via RL—often produce outputs with lower diversity; however, these same preference-tuned models generate greater effective semantic diversity than SFT or base models.
The dominant research narrative is that preference tuning harms diversity, and many open questions persist, including the effect of model size and the ability to generate unique content. We introduce a principled framework for measuring high-quality diversity that requires no human evaluation at inference time, accounts for the fundamental quality-diversity interplay, and enables meaningful comparison across model families and training techniques. The base LLM appears most diverse when using neural cosine diversity alone, despite producing an extremely low proportion of valid (high-quality) generations. Our effective semantic diversity metric penalizes both excessively low and excessively high temperatures, as well as models that struggle to generate coherent content.
We observe a clear trend: all post-training techniques increase both effective semantic diversity and validity relative to base models. RL methods, in particular, yield substantial improvements over SFT in effective semantic diversity. We also find an interesting pattern: preference tuning tends to substantially reduce syntactic and lexical diversity in programming tasks, yet increases these metrics in natural language creative writing. These results suggest that in domains requiring high-quality and diverse outputs, preference-tuned models can outperform both SFT and base models. Furthermore, in creative writing—where diverse word choice and stylistic variety are often desirable—preference-tuned models may hold an advantage in stylistic capabilities.
In code generation, preference tuning—especially reinforcement learning (RL)—is associated with reduced lexical and syntactic diversity but no loss in semantic diversity. For open-ended creative writing, preference tuning is linked to greater diversity in lexical patterns. Finally, when evaluating parameter efficiency for generating unique programs within a fixed sampling budget—for example, when creating unique synthetic data—we find that smaller models, down to around 500 million parameters, are often the most efficient choice. These findings have practical implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.
Future Work: Future research must focus on developing robust defenses for automated review systems. We propose three key directions: (1) Sanitization Layers: Developing specialized parsers that detect and neutralize hidden prompts in PDFs before LLM processing. (2) Adversarial Training: Fine-tuning "Judge" models on datasets of adversarial papers to improve their refusal rates against manipulation. (3) Multi-Modal Attacks: Investigating the vulnerability of Vision-Language Models (VLMs) to visual jailbreaks embedded in scientific figures and charts. We facilitate this future work by open-sourcing our entire experimental suite, providing the community with the necessary tools to secure the integrity of the scientific process.