Does full conversation history improve or degrade multi-turn retrieval accuracy?
This explores whether dumping the entire conversation into a retrieval system helps it find the right thing, or whether more context actually makes accuracy worse.
This reads the question as: when a system retrieves against a long conversation, is it better to feed it everything that's been said, or to be selective? The corpus answers fairly decisively — full history tends to *degrade* accuracy, and the wins come from choosing what to keep, not keeping it all. The most direct evidence is that automatically selecting the relevant prior turns beats throwing in the whole transcript Does including all conversation history actually help retrieval?. The reason is intuitive once named: conversations switch topics, and every off-topic turn you carry forward is noise injected into the retrieval query. Selection there even beats human annotation when the selecting and the retrieving are optimized together.
The same shape — more memory making things worse — shows up from a completely different angle. When a single model continuously compresses and reprocesses conversation memory, performance follows an inverted-U: helpful up to a point, then it drops *below* having no memory at all, because reprocessing misgroups facts, loses context, and overfits Can a single model replace retrieval for long-term conversation memory?. So whether you accumulate raw history or aggressively re-summarize it, unbounded context is a liability. There's a budget, and crossing it hurts.
Laterally, the corpus suggests the fix isn't 'less' so much as 'better-shaped.' Abstract preference summaries beat replaying specific past interactions for personalization, and — notably for this question — recency-based recall beats similarity-based retrieval Does abstract preference knowledge outperform specific interaction recall?. That's the same lesson as turn-selection: a compact, well-chosen signal outperforms a faithful-but-bloated record. There's even an agent-side version of this — capping reasoning *per turn* preserves the context window for later retrieval rounds, where unrestricted reasoning erodes it Does limiting reasoning per turn improve multi-turn search quality?.
The twist worth taking away: not all of multi-turn failure is a retrieval problem at all. Some of what looks like 'lost history' is actually the model never having grounded the user's intent in the first place — RLHF rewards confident single-turn answers over clarifying questions, so models silently drift across turns regardless of how much history you supply Why do language models lose performance in longer conversations? Does preference optimization harm conversational understanding?. And on the recommender side, the lesson flips once more: systems that use *only* the current session leave proven signal on the table, and the fix is integrating history conditioned on current intent — not raw, but filtered through what the user wants now Can conversational recommenders recover lost preference signals from history?.
So the synthesis: full history is rarely the answer. The recurring move across selection, compression, personalization, and recommendation is to carry forward a *curated* representation — relevant turns, abstract preferences, recent signal, intent-conditioned history — and the cost of skipping that curation is measurable degradation, sometimes to below-baseline.
Sources 7 notes
Research shows that automatically selecting relevant previous turns improves retrieval effectiveness more than including all context. Topic switches inject irrelevant information; joint optimization of selection and retrieval beats both full-context baselines and human annotation.
COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.
PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
Current CRS systems only use the active dialogue session to infer preferences, losing item-CF and user-CF signals proven valuable in traditional recommenders. Integrating current session, historical dialogues, and look-alike users—conditioned on current intent—recovers essential user representation structure.