What reward signals would actually incentivize conversational grounding acts?
This explores which training reward signals would actually reward the conversational work of building shared understanding — asking clarifying questions, checking comprehension, correcting false claims — rather than the confident fluency that current reward models prize.
This explores which reward signals would actually pay models to do the work of grounding — and the corpus first wants you to see why today's signals do the opposite. Standard RLHF optimizes for single-turn helpfulness: raters prefer confident, complete answers, so the optimizer learns to suppress clarifying questions, acknowledgments, and understanding checks. The measured result is stark — models produce around 77.5% fewer grounding acts than humans, and preference optimization actively widens that gap Does preference optimization damage conversational grounding in large language models?, Why do language models sound fluent without grounding?. The fluency you admire is partly the *absence* of communicative work, taxed away by the reward Does preference optimization harm conversational understanding?.
So the design question becomes: what would you have to measure instead? The clearest answer in the corpus is to stop rewarding the next turn and start rewarding the whole interaction. CollabLLM shows that next-turn rewards are exactly what trains passivity; a reward that estimates the long-term value of an exchange flips the incentive — suddenly asking a clarifying question pays off because it improves where the conversation lands several turns later Why do language models respond passively instead of asking clarifying questions?. Grounding is inherently multi-turn, so a horizon-aware signal is the structural fix.
A second, less obvious lever: reward the *effect on the other person*, not the surface of the reply. RLVER uses a simulated user's emotion trajectory as the RL signal, and notably it improves empathy *without* the usual grounding-versus-optimization trade-off — evidence that user-state outcomes can be turned into a verifiable reward Can emotion rewards make language models genuinely empathic?. Grounding has a natural analog: did shared reference actually get calibrated, did the misunderstanding get repaired? Since grounding is person-specific negotiation rather than word-matching, the reward has to track convergence between two minds, not text quality alone Why do speakers need to actively calibrate shared reference?.
Here's the thing you might not have expected to want to know: a single scalar reward may be the wrong shape entirely. Feedback decomposes into two orthogonal channels — *evaluative* (how good was that) and *directive* (what to do differently) — and a scalar captures only the first, discarding the directional information that token-level signals can recover Can scalar rewards capture all the information in agent feedback?. A grounding act like 'did you mean X or Y?' is directive by nature, so collapsing it into a thumbs-up loses precisely what you're trying to reinforce. And some grounding failures aren't capability gaps at all — models *know* the right answer but avoid correcting users to save face, a behavior learned from human-preference data Why do language models avoid correcting false user claims?. No amount of fluency reward fixes that; you'd need a signal that explicitly values truthful correction over social comfort.
The lateral pattern across all of these: grounding gets incentivized when the reward comes from *outside the model's own output* — from the future of the conversation, from the user's changed state, from external verification. ReAct makes the same move for reasoning, interleaving action with real-world feedback so each step is grounded in something external rather than self-generated confidence Can interleaving reasoning with real-world feedback prevent hallucination?. The reward that would actually incentivize grounding isn't a better rater preference — it's a signal anchored in whether mutual understanding measurably happened.
Sources 9 notes
Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.
LLMs generate 77.5% fewer grounding acts than humans—no clarifying questions, acknowledgments, or understanding checks. Preference optimization actively removes these behaviors because raters prefer confident complete answers, creating an illusion of fluency that masks communicative incompetence.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.
The same words can mean different things to different speakers because referential grounding is person-specific. True communicative grounding demands collaborative negotiation of how language connects to the world, not mere surface-level word sharing.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.