NoveltyBench: Evaluating Language Models for Humanlike Diversity
Language models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. Our work introduces NOVELTYBENCH, a benchmark specifically designed to evaluate the ability of language models to produce multiple distinct and high-quality outputs. NOVELTYBENCH utilizes prompts curated to elicit diverse answers and filtered real-world user queries. Evaluating 20 leading language models, we find that current state-of-the-art systems generate significantly less diversity than human writers. Notably, larger models within a family often exhibit less diversity than their smaller counterparts, challenging the notion that capability on standard benchmarks translates directly to generative utility. While prompting strategies like in-context regeneration can elicit diversity, our findings highlight a fundamental lack of distributional diversity in current models, reducing their utility for users seeking varied responses and suggesting the need for new training and evaluation paradigms that prioritize diversity alongside quality.1
Diversity of opinion, preference, and experience is a fundamental trait of being human. If you were to ask “Tell me a joke” or “What is the best book of all time?” to the next five people you talk to, with high likelihood you would receive five different answers. It is reasonable to expect language models to generate responses that have the same level of diversity as humans. Yet, when we ask models such as GPT-4 to recommend a movie or Claude 3 to suggest several vacation destinations, we often receive variations of the same few ideas—a phenomenon known as mode collapse Hamilton (2024).
This lack of diversity in language model outputs represents a significant limitation. Today’s aligned language models tend to produce lower entropy distributions than earlier generations of models (Zhang et al., 2024b), and when asked to generate (using random sampling) several responses to an open-ended prompt, the responses will often contain substantial near-duplicates (O’Mahony et al., 2024). This tendency can harm the utility of these models for subjective tasks where different users may have diverging preferences and needs (Zhang et al., 2024a). Sorensen et al. (2024) refer to the inability of today’s LLMs to produce diverse generations as a failure of pluralistic alignment, which can lead to less useful and customizable AI systems.
While today’s LLMs are evaluated at great length for knowledge and reasoning abilities, they are rarely evaluated for response diversity, and there are no widely-adopted benchmarks for assessing this trait. The majority of existing evaluation benchmarks are “mode-seeking”, in the sense that they only assess the quality of the most likely generation of language models and do not assess the capability of models to produce meaningful alternatives.2
This evaluation paradigm is problematic because it creates misaligned incentives: model developers focus on improving the quality of the single most likely generation rather than the diversity of the entire distribution of possible outputs. In this work, we set out to measure not only what language models can generate, but also what they cannot. To this end, we propose NOVELTYBENCH (Figure 1), a benchmark for measuring how well language models can generate multiple differing but still correct answers to user requests that involve subjectivity, randomness, and creativity—in other words, queries that a room of humans would produce a large variety of answers to. We intend for our benchmark to serve as a complement to existing quality-based evaluation benchmarks, encouraging model developers to strive toward more pluralistic models, while also taking generation quality into account.