Conversational AI Systems Language Understanding and Pragmatics Psychology and Social Cognition

Why do language models fail in gradually revealed conversations?

Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.

Note · 2026-02-22 · sourced from Conversation Topics Dialog
Why do AI conversations reliably break down after multiple turns? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Laban et al. (2025) conduct large-scale simulation experiments (200,000+ conversations) comparing LLM performance in single-turn fully-specified vs. multi-turn underspecified settings across six generation tasks. The finding is stark: all top open- and closed-weight LLMs exhibit significantly lower performance in multi-turn conversations, with an average drop of 39%.

The performance degradation decomposes into two components. The minor one is aptitude loss — models are slightly less capable when instructions arrive incrementally. The major one is unreliability increase — when models take a wrong turn, they get lost and do not recover. This is the "lost in conversation" phenomenon.

Four specific failure behaviors drive the degradation:

  1. Overly verbose responses — models generate too much too early
  2. Premature solution proposals — attempting final answers before sufficient information arrives
  3. Incorrect assumptions — filling in underspecified details with guesses
  4. Over-reliance on previous attempts — locking in to early (wrong) answers

The SHARDED simulation methodology is key: it transforms existing single-turn instructions into shards revealed one per turn, enforcing gradual disclosure. The CONCAT control confirms the effect is specifically about underspecification and multi-turn nature, not rephrasing. The drop appears even in two-turn conversations and across all LLMs from 8B to state-of-the-art.

Agent-like mitigations (RECAP: final-turn recapitulation; SNOWBALL: turn-level reminders) recover only 15-20% of the loss. The authors argue LLMs should natively support multi-turn interaction — relying on agent frameworks to preprocess is insufficient. Since Why can't conversational AI agents take the initiative?, this passivity compounds: models neither lead the conversation to gather missing information nor recover when their assumptions prove wrong.

The underspecification tested here is not adversarial — it reflects "the principle of least effort" (Zipf), a natural tendency in human conversation. Users routinely start vague and refine. The models' failure is thus a failure at normal conversation, not edge cases. Since Does preference optimization harm conversational understanding?, the premature assumptions are not random — they are incentivized by RLHF training that rewards confident single-turn answers over grounding acts like clarification. The alignment tax produces models that guess rather than ask, and the lost-in-conversation phenomenon is the multi-turn consequence. More specifically, since Why do language models sound fluent without grounding?, the 77.5% reduction in grounding acts means models skip the clarification and repair mechanisms that would prevent the lock-in to incorrect assumptions. And since Do language models actually build shared understanding in conversation?, the premature assumptions are a specific form of this: filling in underspecified details with guesses is precisely presuming common ground that does not yet exist.

The STORM framework reframes this from a model failure to a fundamental interaction design problem. Since How do users actually form intent when prompting AI systems?, underspecification is not laziness — it reflects that users genuinely cannot articulate their full intent upfront. The "gulf of envisioning" means users lack the vocabulary and conceptual framework to specify what they want, while the AI lacks the ability to help them develop it. This deepens the lost-in-conversation diagnosis: models don't just fail at underspecified inputs — they fail at the process through which intent matures from vague to specific.

MultiChallenge (2025) identifies four specific multi-turn challenge categories that all frontier models fail. Despite near-perfect scores on existing multi-turn benchmarks, all frontier models achieve less than 50% accuracy on MultiChallenge (Claude 3.5 Sonnet at 41.4%). The four categories: (1) instruction retention — following instructions from the first turn throughout the entire conversation; (2) inference memory of user information — recalling and connecting details scattered across previous turns; (3) reliable versioned editing — helping users revise materials through back-and-forth iterations; (4) self-coherence — maintaining consistency with model responses in conversation history and avoiding sycophancy. Each category requires simultaneous instruction-following, context allocation, and in-context reasoning, confirming that multi-turn failure is a compound capability gap, not a single missing skill. Source: Arxiv/Evaluations.


Source: Conversation Topics Dialog, Conversation Architecture Structure

Related concepts in this collection

Concept map
27 direct connections · 209 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llms get lost in multi-turn conversation because they make premature assumptions under underspecification and cannot recover