Conversational AI Systems Psychology and Social Cognition Language Understanding and Pragmatics

Why do language models lose performance in longer conversations?

Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.

Note · 2026-02-22 · sourced from Conversation Topics Dialog
Why do AI conversations reliably break down after multiple turns? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The Intent Mismatch paper offers a fundamentally different explanation for why LLMs get lost in multi-turn conversation. Where Laban et al. attribute the ~30% degradation to model unreliability, this paper argues the root cause is pragmatic mismatch between user expression and model interpretation — an intent alignment gap, not a capability deficit.

Two reframing moves are critical:

First, making premature assumptions is not erroneous behavior but a rational strategy induced by RLHF training. The dominant training objective rewards helpfulness and penalizes evasive responses. Under incomplete information, the model constructs a plausible task formulation for a "typical" user and produces a provisional answer — because that is what the training signal demands. The model is doing exactly what we trained it to do; the problem is that we trained it for the wrong thing in multi-turn contexts.

Second, the bottleneck is not model capacity or reasoning depth but pragmatic mismatch. Users exhibit systematic individual variation — the same utterance may map to disparate underlying intentions. General-purpose LLMs, aligned to the "average" user, cannot adapt to idiosyncratic behaviors. Models frequently misinterpret fragmentary continuations as confirmations rather than corrections, reinforcing incorrect context.

The proposed fix is architectural: a Mediator-Assistant pipeline that decouples intent understanding from task execution. The Mediator explicates user inputs — articulating latent requirements before they reach the execution Assistant. An LLM-based Refiner distills explicit guidelines from discrepancies between failed and successful interaction trajectories. This enables adaptation to individual user behaviors without weight updates.

The theoretical claim is strong: scaling model size or improving training alone cannot resolve this gap, because it arises from structural ambiguity in conversational context rather than representational limitations. This challenges the implicit assumption that bigger/better models will solve multi-turn problems. The QuestBench finding reinforces this: since Can models identify what information they actually need?, the Mediator's role in explicating latent requirements addresses a capability that models demonstrably lack — they cannot identify what information is missing even when they can solve the fully-specified version of the problem. The intent alignment gap is thus not just about pragmatic mismatch but about a separable cognitive deficit in information gathering. Furthermore, since Why do reasoning models overthink ill-posed questions?, when intent is genuinely underspecified (as it is in most multi-turn conversation), reasoning models compound the problem by overthinking rather than recognizing incompleteness — making the Mediator architecture even more necessary.

Since Why do language models respond passively instead of asking clarifying questions?, CollabLLM's reward-signal fix and this paper's architectural fix represent complementary intervention levels for the same underlying problem.

The multi-turn degradation problem exists on both sides of the interaction. User simulators — the systems that conversational agents train against — exhibit the same goal misalignment: they "struggle to consistently adhere to their user goals throughout conversations," failing to maintain profiles, manage multiple objectives, or complete within conversation limits. When simulators drift, they generate conversations that teach agents wrong behaviors through misleading reward signals. See Why do LLM user simulators fail to track their own goals?. This is the evaluation-side manifestation: agent degradation and evaluation degradation compound each other.


Source: Conversation Topics Dialog

Related concepts in this collection

Concept map
21 direct connections · 153 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

multi-turn performance degradation is an intent alignment gap not an intrinsic capability deficit — decoupling intent understanding from task execution recovers lost performance