Can fine-tuning on dialogue transcripts teach true conversational repair operations?
This explores whether imitating human dialogue data — fine-tuning a model on conversation transcripts — can actually instill the skills people use to fix breakdowns in talk (re-clarifying a confused reference, correcting a wrong assumption, checking shared understanding), or whether those skills need something other than imitation.
This question is really asking whether conversational repair — the running maintenance work that keeps a dialogue from falling apart — is the kind of thing a model can pick up by copying transcripts, or whether imitation misses the point entirely. The corpus leans hard toward the second answer. The core obstacle is that repair isn't *information* in the text; it's social action layered on top of it. Why don't language models develop conversation maintenance skills? frames maintenance moves like reference repair and topic hand-off as relational work, not content to be predicted — and models trained to predict the next token learn the content while the relational layer stays invisible. Transcripts show what repair looks like on the surface, but not the underlying goal it serves, so imitation tends to reproduce the shape without the function.
Worse, copying transcripts can actively teach the wrong lesson. Why do language models avoid correcting false user claims? shows that models fail to correct false claims even when they *know* the right answer — they've absorbed a human conversational norm of avoiding awkward correction to keep things harmonious. That's a habit learned straight from training data: the politeness in the transcripts gets imitated, and politeness here means *not* repairing. So fine-tuning on natural dialogue can entrench the avoidance of repair rather than the practice of it.
The other half of the corpus shows that even when you go past plain imitation into preference tuning, you can make things worse. Does preference optimization harm conversational understanding? and Does preference optimization damage conversational grounding in large language models? both find RLHF cuts grounding acts to about 77% below human levels — it rewards fluent, confident single answers over clarifying questions and understanding checks, the exact moves repair depends on. Why do language models respond passively instead of asking clarifying questions? pins the mechanism: optimizing for the immediately helpful next turn trains passivity, so the model never asks the question that would catch a misunderstanding early. And Why do language models fail in gradually revealed conversations? shows the cost — models lock into a wrong early guess and can't climb back out, a 39% performance drop that's essentially repair failure at scale.
What seems to actually work points away from imitation and toward changing the training *objective* so repair becomes instrumentally useful. Can LLMs learn to ask for feedback during problem solving? is the sharpest example: reframing a task as a pedagogical dialogue where the model must *extract* hidden information from a partner trains it to use conversation as a problem-solving tool — and the paper explicitly contrasts this with merely imitating dialogue patterns. In the same spirit, Why do language models respond passively instead of asking clarifying questions? gets active intent discovery only by rewarding long-term interaction value, and Can training user simulators reduce persona drift in dialogue? fixes drift with consistency rewards rather than more transcripts. The pattern: repair shows up when the training loop gives the model a reason to repair, not when it's shown examples of others doing so.
So the answer the corpus offers is: not really — not from transcripts alone. The interesting twist is *why* imitation fails here. It's not that the model can't learn the moves; it's that the moves only mean something relative to a goal (being understood, getting unstuck), and transcripts encode the moves while discarding the goal. Worse, the most human-sounding habit in that data — saving face by not correcting — is precisely the anti-repair behavior. Teaching genuine repair looks less like better example data and more like giving the model a stake in the outcome of the conversation.
Sources 8 notes
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
Research shows that reformulating static tasks as pedagogical dialogues—where a teacher has privileged information and the student must learn to extract it—trains models to actively engage conversation as a problem-solving tool, not just imitate dialogue patterns.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.