NoveltyBench

Evaluating Language Models for Humanlike Diversity

Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen Liu, Xinyue Liu, Vinay Samuel, Barry Wang, Daphne Ippolito

Leaderboard

distinct_k: the average number of distinct generations out of k generations.

utility_k: the cumulative utility out of k generations.

Model family ↕	Variant ↕	Mean distinct₁₀ ↕	Mean utility₁₀ ↕	Date
Anthropic	Claude-3.5 Haiku	1.94	2.50	2025-03-27
Anthropic	Claude-3.5 Sonnet	1.76	2.36	2025-03-27
Anthropic	Claude-3 Opus	2.04	2.67	2025-03-27
OpenAI	gpt-4o-mini	2.65	3.11	2025-03-27
OpenAI	gpt-4o	2.88	3.27	2025-03-27
Gemini	gemini-1.5-pro	1.85	2.73	2025-03-27
Gemini	gemini-2.0-flash-lite	2.83	3.20	2025-03-27
Gemini	gemini-2.0-flash	2.81	3.17	2025-03-27
Gemini	gemini-2.0-pro	2.25	2.64	2025-03-27
Cohere	command-r7b	3.58	3.35	2025-03-27
Cohere	command-r	2.68	2.98	2025-03-27
Cohere	command-r-plus	2.79	3.08	2025-03-27
Gemma 2	gemma-2-2b-it	5.66	4.63	2025-03-27
Gemma 2	gemma-2-9b-it	3.25	3.93	2025-03-27
Gemma 2	gemma-2-27b-it	3.03	3.77	2025-03-27
Llama 3	Llama-3.2-1B	6.74	2.81	2025-03-27
Llama 3	Llama-3.2-3B	5.10	3.24	2025-03-27
Llama 3	Llama-3.1-8B	5.24	3.76	2025-03-27
Llama 3	Llama-3.3-70B	2.49	2.87	2025-03-27
Llama 3	Llama-3.1-405B	3.20	3.39	2025-03-27

What is NoveltyBench?

NoveltyBench is a benchmark designed to evaluate language models' ability to generate simultaneously distinct and high-quality outputs. We set out to evaluate language models not only by what they can generate, but also by what they cannot generate. Specifically, unlike conventional benchmarks that assess only the quality of a single "best" generation, we measure both diversity and quality over the output distribution.

Why NoveltyBench?

It is well-known that today's language models suffer from mode collapse, the inability to produce a variety of outputs even when diversity is expected. Ask Claude or GPT-4 for vacation recommendations multiple times, and you'll often get variations of the same few destinations, unlike asking different humans who would suggest a wide range of options.

This lack of diversity matters because different users have different needs and preferences. A single "best" answer rarely exists for subjective tasks like recommending books, suggesting creative solutions to puzzles, or generating stories with unique twists. When models consistently produce similar outputs, they can't effectively serve the full spectrum of human preferences and may reinforce existing biases by suggesting certain answers are universally "correct."

Notably, the majority of existing evaluation benchmarks are "mode-seeking", which evaluate models based on their ability to generate exactly one correct or high-quality response. This creates misaligned incentives for model developers, who focus on improving the quality of the single most likely output rather than ensuring diversity across possible responses.

How does it work?

NoveltyBench consists of 1,100 prompts designed to elicit diverse responses, including NB-Curated (100 manually crafted prompts that span four categories: randomness, factual knowledge, creative writing, and subjectivity) and NB-WildChat (1,000 prompts collected from real user interactions with ChatGPT). We introduce two metrics to address the limitations of traditional diversity measures:

distinct_k, which counts the number of "meaningfully" distinct responses a model generates in k samples.
utility_k, which unifies generation novelty and quality by modeling the cumulative utility for a user who only benefits from an additional generation when it is distinct from the previous generations.

Key Findings

Our evaluation of 20 leading language models reveals four key findings:

State-of-the-art models generate significantly less diversity than humans, producing on average fewer than 4 distinct responses in 10 samples.
Novelty seems to scale inversely with model size, with smaller models often demonstrating greater diversity than their larger counterparts within the same family. Utility of larger models degrades more quickly when users demand diverse outputs due to their propensity for mode collapse.
While prompt engineering techniques like in-context regeneration can partially improve diversity, our findings nevertheless reveal a fundamental lack in distributional diversity in current models, suggesting the need for new training paradigms that prioritize diversity alongside quality.

Acknowledgements

This website is based on the SWE-bench leaderboard, used with the permission of the SWE-bench team. We thank them for their work in creating and sharing the template.