Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
Laban et al. (2025) conduct large-scale simulation experiments (200,000+ conversations) comparing LLM performance in single-turn fully-specified vs. multi-turn underspecified settings across six generation tasks. The finding is stark: all top open- and closed-weight LLMs exhibit significantly lower performance in multi-turn conversations, with an average drop of 39%.
The performance degradation decomposes into two components. The minor one is aptitude loss — models are slightly less capable when instructions arrive incrementally. The major one is unreliability increase — when models take a wrong turn, they get lost and do not recover. This is the "lost in conversation" phenomenon.
Four specific failure behaviors drive the degradation:
- Overly verbose responses — models generate too much too early
- Premature solution proposals — attempting final answers before sufficient information arrives
- Incorrect assumptions — filling in underspecified details with guesses
- Over-reliance on previous attempts — locking in to early (wrong) answers
The SHARDED simulation methodology is key: it transforms existing single-turn instructions into shards revealed one per turn, enforcing gradual disclosure. The CONCAT control confirms the effect is specifically about underspecification and multi-turn nature, not rephrasing. The drop appears even in two-turn conversations and across all LLMs from 8B to state-of-the-art.
Agent-like mitigations (RECAP: final-turn recapitulation; SNOWBALL: turn-level reminders) recover only 15-20% of the loss. The authors argue LLMs should natively support multi-turn interaction — relying on agent frameworks to preprocess is insufficient. Since Why can't conversational AI agents take the initiative?, this passivity compounds: models neither lead the conversation to gather missing information nor recover when their assumptions prove wrong.
The underspecification tested here is not adversarial — it reflects "the principle of least effort" (Zipf), a natural tendency in human conversation. Users routinely start vague and refine. The models' failure is thus a failure at normal conversation, not edge cases. Since Does preference optimization harm conversational understanding?, the premature assumptions are not random — they are incentivized by RLHF training that rewards confident single-turn answers over grounding acts like clarification. The alignment tax produces models that guess rather than ask, and the lost-in-conversation phenomenon is the multi-turn consequence. More specifically, since Why do language models sound fluent without grounding?, the 77.5% reduction in grounding acts means models skip the clarification and repair mechanisms that would prevent the lock-in to incorrect assumptions. And since Do language models actually build shared understanding in conversation?, the premature assumptions are a specific form of this: filling in underspecified details with guesses is precisely presuming common ground that does not yet exist.
The STORM framework reframes this from a model failure to a fundamental interaction design problem. Since How do users actually form intent when prompting AI systems?, underspecification is not laziness — it reflects that users genuinely cannot articulate their full intent upfront. The "gulf of envisioning" means users lack the vocabulary and conceptual framework to specify what they want, while the AI lacks the ability to help them develop it. This deepens the lost-in-conversation diagnosis: models don't just fail at underspecified inputs — they fail at the process through which intent matures from vague to specific.
MultiChallenge (2025) identifies four specific multi-turn challenge categories that all frontier models fail. Despite near-perfect scores on existing multi-turn benchmarks, all frontier models achieve less than 50% accuracy on MultiChallenge (Claude 3.5 Sonnet at 41.4%). The four categories: (1) instruction retention — following instructions from the first turn throughout the entire conversation; (2) inference memory of user information — recalling and connecting details scattered across previous turns; (3) reliable versioned editing — helping users revise materials through back-and-forth iterations; (4) self-coherence — maintaining consistency with model responses in conversation history and avoiding sycophancy. Each category requires simultaneous instruction-following, context allocation, and in-context reasoning, confirming that multi-turn failure is a compound capability gap, not a single missing skill. Source: Arxiv/Evaluations.
Source: Conversation Topics Dialog, Conversation Architecture Structure
Related concepts in this collection
-
Why can't conversational AI agents take the initiative?
Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
passivity prevents recovery; models can't redirect when lost
-
Why do language models respond passively instead of asking clarifying questions?
Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.
next-turn rewards are the training cause of premature solution proposals
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
proactive questioning is exactly the missing capability
-
Do prior errors in context history amplify future errors?
When a language model makes mistakes early in a task, do those errors contaminate subsequent predictions? We explore whether error accumulation degrades long-horizon performance through passive context pollution rather than capability limits.
the lock-in mechanism: prior errors in context amplify future error rates
-
How do users actually form intent when prompting AI systems?
Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
underspecification reflects genuine inability to articulate intent, not user laziness
-
Why do users drift away from their original information need?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
ASK is the user-side cause of the underspecification that triggers premature assumptions: users in an anomalous knowledge state produce the vague queries that models cannot handle
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
RLHF incentivizes premature assumptions by rewarding confident answers over clarification; the training cause of the lost-in-conversation phenomenon
-
Why do language models sound fluent without grounding?
Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?
the 77.5% grounding act reduction means models skip the communicative work that would prevent lock-in to incorrect assumptions
-
Do language models actually build shared understanding in conversation?
When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
premature assumptions under underspecification are a specific form of presuming common ground that does not yet exist
-
Can language models track how minds change during persuasion?
Do LLMs understand evolving mental states in persuasive dialogue, or do they only capture fixed attitudes? This explores whether models can update their reasoning as a person's beliefs shift across conversation turns.
the static/dynamic ToM gap is a cognitive mechanism for getting lost: models can snapshot initial user state but cannot track how it evolves across turns, causing assumptions to diverge from the user's actual shifting needs
-
Can cumulative rewards teach LLMs multi-step decision making?
Explores whether attributing full episode rewards to each step enables large language models to solve sequential tasks effectively. This matters because current RL methods fail at multi-turn reasoning despite strong single-turn performance.
training-level fix: MS-GRPO's cumulative episode reward teaches models that early-turn decisions have downstream consequences, directly addressing the premature-commitment failure where models lock in to assumptions they cannot revise
-
Does including all conversation history actually help retrieval?
Conversational search systems typically use all previous context to understand current queries. But do topic switches in multi-turn conversations inject noise that degrades performance rather than helps it?
the retrieval-side fix for the lost-in-conversation problem: selective history prevents topic-switch contamination from making the current query context incoherent; the model gets lost partly because irrelevant prior turns warp the effective context
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
the diagnostic root: models that solve fully-specified problems at 40-50% on clarification tasks cannot identify what's missing when instructions arrive gradually; the information-gathering deficit precedes and causes the premature assumptions
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
the behavioral mechanism: when underspecification creates ill-posed situations, reasoning models overthink rather than recognizing incompleteness — producing the verbose, non-recovering responses that characterize being "lost"
-
Why do AI agents misalign with what users actually want?
UserBench explores how often AI models fully understand user intent across multi-turn interactions. The study reveals that human communication is underspecified, incremental, and indirect — traits that challenge current models to actively clarify goals.
UserBench quantifies the downstream cost of premature assumptions: the 20% full-alignment rate reflects models that guess rather than elicit, and the <30% preference discovery rate confirms models cannot recover from initial misunderstandings
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llms get lost in multi-turn conversation because they make premature assumptions under underspecification and cannot recover