LLMs Get Lost In Multi-Turn Conversation
Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in singleand multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.
Even though a growing body of work proposes to evaluate LLMs in a multi-turn fashion, we identify in our review (Section 2) that most prior work treats the conversation as episodic: conversation turns might relate to each other, but the conversation can effectively be decomposed as an array of subtasks that can be evaluated in isolation. We argue that episodic tasks move away from what is prevalent in human conversation: underspecification [91, 27].
In this work, we close this gap by creating a simulation environment for multi-turn underspecified conversations – sharded simulation – that leverages existing instructions from high-quality single-turn benchmarks. At a high level, the sharding process we propose transforms existing single-turn instructions into sharded instructions, a set of smaller instructions that jointly deliver the same information as the original instruction. Sharded simulation then ensures that each turn of conversation reveals at most one shard of information per conversation turn, enforcing that the instruction is gradually revealed through the conversation.
On the set of tasks that we experimented on, we observed that models engaged in multi-turn underspecified conversations achieved an average performance of 65%–a 25-point drop from single-turn performances of 90% when they receive the entire instruction at the beginning of the conversation. Notably, we observe this drop in performance even in two-turn conversations, and across all LLMs we test, from small open-weights (LLama3.1-8B-Instruct) to state-of-the-art (Gemini 2.5 Pro).
Furthermore, we decompose the performance degradation into two components: (1) loss in aptitude, and (2) increase in unreliability. We find that in single-turn settings, models with higher aptitude tend to be more reliable (e.g., GPT-4.1, Gemini 2.5 Pro). On the other hand, all LLMs exhibit very high unreliability in multi-turn settings, regardless of aptitude. We refer to this as the lost in conversation phenomenon: when LLMs take a wrong turn in multi-turn conversation, they get lost and do not recover.
LLMs tend to (1) generate overly verbose responses, leading them to (2) propose final solutions prematurely in conversation, (3) make incorrect assumptions about underspecified details, and (4) rely too heavily on previous (incorrect) answer attempts
Crucially, such works typically simulate episodic conversations: each turn in the conversation introduces a subtask that relates to previous conversation turns, but can be evaluated in isolation. In this work, we find that episodic tasks overestimate LLM performance in multi-turn conversations (see Section 7.3). In short, although episodic tasks require some level of multi-turn context understanding, they do not involve actively fusing the information to answer underspecified user instructions. Underspecified user instructions are not only common in real-world human-AI communication [27], but also a natural tendency in conversations, termed “the principle of least effort” [91]. We show that underspecification in multi-turn conversations leads to large and universal performance degradations: LLMs make early assumptions to fill in for missing information, prematurely attempt to propose finalized solutions, and have difficulty adapting and course-correcting when provided with new information. We make underspecification the central element of our evaluation setting.
Multi-turn episodic evaluation is sometimes framed as a way to evaluate multi-turn model capabilities with higher granularity. Categories of subtasks (such as refinement, follow-up, expansion, etc.) allow for the study of more specific LLM behavior [2, 37, 74, 19, 16, 48, 25]. According to such framing, multi-turn tasks differ from single-turn tasks and are not evaluated on the same set of tasks. We argue that this framing is artificial and limits the scope of multi-turn evaluation, restricting the direct comparison of multi-turn and single-turn abilities of LLMs. In our work, we conduct both single-turn and multi-turn conversation simulations on a common set of tasks: controlled experiments that precisely allow us to identify performance degradations from single- to multi-turn settings.
Evaluating LLMs in multi-turn settings is a challenge because conversational trajectories diverge far more than in a single-turn. Thus, most previous studies have focused on classification or short-form tasks, with more straightforward evaluation settings. However, the predominant use cases for LLMs are generative in nature, both for programming (e.g., coding assistants) and natural language (e.g., writing, summarizing) [88, 26]. Long-form evaluation in the multi-turn setting is therefore essential, as it assesses models’ ability to flexibly adapt and refine the response as the users provide more information. In this work, we focus exclusively on generation tasks that capture widely used scenarios in both programming and natural language domains.
Scaling multi-turn experimentation requir es simulating a user. Existing studies implemented such user simulation in different ways: relying on templates [12, 68, 39, 16], using an LLM [63, 46, 7, 48], involving human annotators [21, 7], or real users as part of a study [67, 38, 11]. Although involving real users leads to the most natural and realistic conversations, it comes at the cost of scalability and reproducibility. In this work, we adopt an LLM-based simulator to enable controlled flexibility and divergence. Nevertheless, a fully automated simulation limits the scope of our findings: the conversations we simulate are not representative of human-AI conversations. We therefore frame the simulation as a tool to study the LLM behavior in the multi-turn setting rather than user behavior. In addition, as detailed in the Limitations Section (Section 9), we argue that our simulation framework is simplistic and idealized.
Figure 4: Conversation simulation types based on sharded instructions. Once an original fully-specified instruction (blue block) is sharded (set of yellow blocks), the “shards” can be used to simulate single-turn (FULL, CONCAT) or multi-turn (SHARDED, RECAP, SNOWBALL) conversations, affecting the pace of information disclosure. We leverage sharded instructions to simulate five types of single- or multi-turn conversations, as illustrated in Figure 4. We now introduce each one and explain its purpose in our experiments.
FULLY-SPECIFIED (short-form: FULL) simulates single-turn, fully-specified conversations in which the original instruction is provided to the LLM in the first turn. This simulation type evaluates baseline model performance on the tasks.
SHARDED simulates multi-turn, underspecified conversations as outlined above. SHARDED simulations are our primary tool to evaluate model performance in underspecified, multi-turn conversations.
CONCAT simulates single-turn, fully-specified conversation based on the sharded instruction. The shards are concatenated into a single instruction in bullet-point form (with one shard per line), preceded by an instruction to complete the task taking into account all bullet-points. The CONCAT simulation is a logical mid-point between full and sharded, in which underspecification is removed (like FULL) but the rephrasing that occurred during instruction sharding is preserved (like SHARDED). CONCAT is intended as a verification baseline: a model that succeeds at both FULL and CONCAT, but not at SHARDED, struggles specifically because of underspecification and the multi-turn nature of the conversation, and not due to the rephrasing that occurred during the sharding process, which may have led to information loss.
RECAP simulates a SHARDED conversation, and adds a final recapitulation turn which restates all the shards of the instruction in a single turn, giving the LLM one final attempt at responding. RECAP is a combination of the SHARDED simulation followed by a CONCAT turn, and is explored as a method in Section 7.1 to evaluate whether such a conceptually simple agent-like intervention can mitigate the loss in performance observed in SHARDED conversations.
SNOWBALL takes the RECAP simulation a step further, implementing turn-level recapitulation. At each turn, the user simulator introduces a new shard, but also restates all the shards that have been revealed so far in the conversation, producing a snowball effect as each turn reveals all the information from the previous turn, plus one additional shard. The redundancy implemented in the SNOWBALL simulation is also explored as a method in Section 7.1 to study whether turn-level reminders help alleviate the need for LLMs to recall information across multiple turns of context.
Building LLM-based applications typically involves complex processes: decomposition of problems, retrieval of relevant information, use of tools, and calling of actions. Such processes are typically orchestrated by an agentic framework (such as Autogen [84] or LangChain [8]) that allows system builders to compose workflows with LLM calls as individual blocks. As such, an argument could be made that multi-turn capabilities are not a necessary feature of LLMs, as it can be offloaded to the agent framework. In other words, do we need native multi-turn support in LLMs when an agent framework can orchestrate interactions with users and leverage LLMs only as single-turn operators? To answer this question, we implemented two agent-style conversation simulation types: RECAP and SNOWBALL. Both preprocess user utterances before sending them to the LLM. In RECAP, a conversation proceeds in the same way as SHARDED, but a user turn is added at the end, which recapitulates all the previous user turns. SNOWBALL is a more gradual recapitulation: at each turn, the user simulator reveals a new shard, and repeats all previously revealed shards at that point. Both simulation types repeat the past user’s turn information to make it more prominent and give the LLM a chance to leverage the redundancy to improve its responses. We include the experimental detail in Appendix M. Table 2 summarizes the results on all instructions for four tasks (Code, Database, Math, Actions) for two tested models (GPT-4o, GPT-4o-mini). Both RECAP and SNOWBALL demonstrate some level of success, with improvements over SHARDED simulations, but the performance still lags behind FULL or CONCAT. While RECAP outperforms SNOWBALL, we note that RECAP is an unrealistic setting because the intervention is conducted on the last turn of the conversation, which is not known a priori when conversation unfolds with a real user. SNOWBALL gives a sense of realistic performance gains achievable through user-turn repetition: it can mitigate the FULL-to-SHARDED performance deterioration by 15-20%. In short, relying on an agent-like framework to process information might be limiting, and we argue LLMs should natively support multi-turn interaction.