How does treating conversation as a resource change what models learn to do?
This explores what shifts when training treats the conversation itself — not just the next reply — as a source of information and value the model can act on.
This explores what shifts when training treats the conversation itself — not just the next reply — as a source of information and value the model can act on, rather than as a sequence of isolated prompts to answer. The corpus suggests the default training setup quietly teaches models the opposite. Standard RLHF rewards immediate helpfulness, so models learn to answer fast and confidently instead of asking what you actually meant — an 'alignment tax' that drops grounding acts like understanding-checks and clarification to a fraction of human levels Does preference optimization harm conversational understanding?. The visible symptom is that models get worse over a long conversation, but the research reframes this: it's not lost capability, it's intent misalignment baked in by next-turn reward Why do language models lose performance in longer conversations?.
The pivot happens when the reward stops looking only at the next turn. CollabLLM estimates the long-term value of an interaction, which suddenly makes asking a clarifying question the smart move instead of a penalty — the model learns to discover your intent rather than guess at it Why do language models respond passively instead of asking clarifying questions?. Even more striking, you may not need to reward conversation directly: social meta-learning trains models on fully-specified problems, and the ability to treat conversation as an information source to draw on emerges on its own, so the model starts asking for missing pieces instead of answering prematurely Can models learn to ask clarifying questions without explicit training?. Once conversation is a resource, proactivity becomes learnable too — volunteering relevant information before being asked can cut dialogue length by up to 60% Could proactive dialogue make conversations dramatically more efficient?.
What's quietly interesting is how many specific human conversational skills turn out to be absent simply because nothing in training rewarded them. Models don't mirror a user's word choices (lexical entrainment), a basic rapport-building move, until preference data is built to teach it Why don't conversational AI systems mirror their users' word choices?. They engage with off-topic distractors because they're trained on what-to-do instructions but never what-to-ignore — a gap closable with barely a thousand examples Why do language models engage with conversational distractors?. And the smooth maintenance work of conversation — repairing references, handing off topics — never develops at all, because it's relational rather than informational, and the training signal only rewards predicting information Why don't language models develop conversation maintenance skills?.
There's a deeper limit lurking underneath all of this. Treating conversation as a resource assumes there's something for the model to carry — but an LLM has no persistent host between sessions; each instance is reconstituted from stored text, so 'resumed' and 'new' conversations are structurally identical Does an LLM have anything that persists between conversations?. That's why so much of this work routes around weights entirely: agents store verbal self-reflections in episodic memory and improve across attempts without any parameter update Can agents learn from failure without updating their weights?, and others fold memory generation into the response itself — though that consolidation can backfire, degrading below a no-memory baseline as context piles up Can a single model replace retrieval for long-term conversation memory?. Taken together, the corpus reframes a lot of 'model is dumb in long chats' complaints as 'we trained it to treat each turn as a transaction' — and shows that when conversation becomes the resource, the model learns to ask, wait, mirror, stay on topic, and proactively help, behaviors that were never missing for lack of capacity, only for lack of a reason to learn them.
Sources 11 notes
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Models trained via SML on complete problems generalize to underspecified tasks by asking for needed information and delaying answers. The training paradigm instills a meta-strategy of using conversation as an information source, addressing the premature-answering failure mode.
Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.
Response generation models fail to adapt vocabulary toward users' lexical choices, a phenomenon central to human rapport and clarity. Post-training via DPO on coreference-identified preferences can teach models in-context convention formation.
Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
While humans have a continuous biological-phenomenological substrate that preserves interaction effects during dormancy, LLMs have no analogous carrier. The virtual instance is reconstituted from stored text each time, making resumed and new conversations structurally identical.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.