Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

Paper · arXiv 2602.07338 · Published February 7, 2026

Multi-turn conversation has emerged as a predominant interaction paradigm for Large Language Models (LLMs). Users often employ follow-up questions to refine their intent, expecting LLMs to adapt dynamically. However, recent research (Laban et al., 2025) reveals that LLMs suffer a substantial performance drop in multi-turn settings compared to single-turn interactions with fully specified instructions, a phenomenon termed “Lost in Conversation” (LiC). While this prior work attributes LiC to model unreliability, we argue that the root cause lies in an intent alignment gap rather than intrinsic capability deficits. In this paper, we first demonstrate that LiC is not a failure of model capability but rather a breakdown in interaction between users and LLMs. We theoretically show that scaling model size or improving training alone cannot resolve this gap, as it arises from structural ambiguity in conversational context rather than representational limitations. To address this, we propose to decouple intent understanding from task execution through a Mediator-Assistant architecture. By utilizing an experience-driven Mediator to explicate user inputs into explicit, well-structured instructions based on historical interaction patterns, our approach effectively bridges the gap between vague user intent and model interpretation. Experimental results demonstrate that this method significantly mitigates performance degradation in multi-turn conversations across diverse LLMs.

In contemporary AI-assisted applications, multiturn dialogue has become the primary mode of interaction between users and large language models (LLMs). Thanks to their massive parameter scales and extensive pretraining on diverse corpora, modern LLMs now exhibit impressive capabilities in language understanding, reasoning, and task execution, and they often perform remarkably well when given clear, complete, and well-structured instructions in a single turn. However, real-world user behavior rarely conforms to this idealized setting.

In practice, users frequently start with vague, underspecified, or even internally inconsistent goals, and only gradually clarify and refine their true needs through an iterative conversational process with the model (Zamfirescu-Pereira et al., 2023; Min et al., 2020). This incremental, exploratory nature of human problem formulation poses substantially greater challenges for LLMs than standard single turn benchmarks: the model must not only understand and solve the current subtask, but also continually infer, update, and realign with a moving target of user intent across turns.

Recent research (Laban et al., 2025) presents a set of controlled experiments designed to simulate the instruction unders pecification that frequently occurs in human conversation (Herlihy et al., 2024; Zipf, 1949). The study systematically compares performance under “single-turn, fully specified” (Full) versus “multi-turn, underspecified” (Sharded) interactions, revealing a substantial performance degradation of approximately 30% for all evaluated LLMs. The authors argue that under incomplete information, LLMs tend to make premature assumptions early in the dialogue and subsequently “lock in” these assumptions, causing the final responses to drift away from the user’s true intent. They term this phenomenon “Lost in Conversation” (LiC) and primarily attribute it to the reduced reliability of LLMs in multi-turn dialogue.

On this basis, they advocate that LLMs should natively support multi-turn interaction and that model builders should jointly optimize models’ aptitude and reliability in iterative conversational settings.

In this work, we revisit this phenomenon and offer a different explanatory perspective. We argue that: (1) Making early assumptions and providing tentative answers is not simply erroneous behavior, but a rational strategy induced by the dominant training objective of being helpful (Ouyang et al., 2022) and the penalty often associated with evasive responses in RLHF pipelines. Under conditions of incomplete information, the model is inclined to construct a plausible task formulation for a typical user and produce a provisional answer based on that formulation, instead of repeatedly refusing to answer or endlessly requesting additional information. (2) The primary bottleneck in failed multi-turn conversations is not a lack of model capacity or reasoning depth, but a pragmatic mismatch between user expression and model interpretation (Figure 1 left). Users exhibit systematic individual variation, where the same utterance may map to disparate underlying intentions. General-purpose LLMs, aligned to the “average” user, fail to adapt to these idiosyncratic behaviors. For instance, models frequently misinterpret a user’s fragmentary continuation as a confirmation of previous assumptions rather than a correction, thereby reinforcing an incorrect context.

To address this, we propose a framework that fundamentally decouples intent understanding from task execution. We operationalize this through a Mediator-Assistant pipeline, where a Mediator explicates user inputs to explicitly articulate latent requirements before they reach the execution Assistant. To align with specific user pragmatics, we employ an LLM-based Refiner to automatically distill explicit guidelines by analyzing the discrepancies between failed and successful interaction trajectories. These guidelines then serve as context for the Mediator, enabling the system to bridge the alignment gap and adapt to individual user behaviors without the need for weight updates. Our approach directly addresses the root cause of LiC: the misalignment between how users express intent and how models interpret it (Figure 1 right). By bridging this gap through adaptive input rewriting, we demonstrate substantial recovery of multi-turn performance across diverse LLMs, highlighting the critical role of user-aware intent modeling in conversational AI.

Recent research (Laban et al., 2025) presents a set of controlled experiments designed to simulate the instruction under specification that frequently occurs in human conversation (Herlihy et al., 2024; Zipf, 1949). The study systematically compares performance under “single-turn, fully specified” (Full) versus “multi-turn, underspecified” (Sharded) interactions, revealing a substantial performance degradation of approximately 30% for all evaluated LLMs. The authors argue that under incomplete information, LLMs tend to make premature assumptions early in the dialogue and subsequently “lock in” these assumptions, causing the final responses to drift away from the user’s true intent. They term this phenomenon “Lost in Conversation” (LiC) and primarily attribute it to the reduced reliability of LLMs in multi-turn dialogue.

Constrained by the limited scale of existing benchmarks, our current Refiner operates in a few-shot, non-parametric manner, extracting explicit and heuristic-level guidelines. While this design ensures efficiency, it captures only coarse-grained interaction patterns. Future work could leverage larger-scale datasets to transition towards parameterized training, enabling the Mediator to internalize more nuanced alignment strategies via finetuning rather than relying solely on in-context summaries. Furthermore, the multi-turn settings in current benchmarks exhibit relatively homogeneous user logic. Developing more comprehensive benchmarks that mirror complex user behaviors remains a critical direction to further validate and evolve our framework.