MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Paper · arXiv 2501.17399 · Published January 29, 2025

We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time. We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achieving just a 41.4% average accuracy.

Each test example in MultiChallenge is a maximum 10-turn conversation history between two parties, ending with a final user turn containing a requirement/question. LLMs are required to respond to the final user turn properly given the multi-turn conversation history. We identify four categories of challenges in multiturn conversations that are not only common and realistic among current human-LLM interactions, but also difficult for current frontier LLMs. They include instruction retention, inference memory of user information, reliable versioned editing, and self-coherence.

Instruction retention evaluates whether LLMs are able to follow instructions specified in the first user turn throughout the entire multi-turn conversation. Inference memory of user information evaluates LLMs on recalling and connecting relevant details scattered in previous user turns when they are implicitly required to respond to the final user turn. Reliable versioned editing evaluates whether LLMs can properly help humans revise existing materials through back-and-forth iterations with human users. Finally, self-coherence evaluates whether LLMs can be reasonably coherent with model responses in the conversation history and avoid sycophancy (unconditionally agreeing to human users).