Why do standard alignment methods ignore partner interventions?
Standard RLHF and DPO optimize for token-level quality but may structurally prevent agents from meaningfully incorporating partner input. This explores whether the training objective itself blocks collaborative reasoning.
Standard reinforcement learning and preference alignment algorithms (PPO, DPO) produce agents that are token-level optimal but collaboration-level suboptimal. The Interruptible Collaborative Roleplayer (ICR) paper demonstrates this through a Modified-Action MDP formulation: agents trained with standard methods are naturally inclined to ignore well-meaning interventions from partners, even when those interventions would improve task outcomes.
The mechanism is structural. RLHF optimizes for response quality given the current context, treating partner utterances as just more context. But collaboration requires something different: selectively incorporating helpful suggestions while maintaining reasoning integrity against misleading ones. An agent that merely mimics cooperative behavior — reflexively adopting suggestions — appears cooperative but is fragile. An agent that ignores interventions is robust but uncooperative. Standard training conflates these.
The fix is counterfactual invariance regularization. During training, ICR applies a counterfactual prompt prefix that nullifies the specific influence pathway of an intervention. The agent's policy is regularized to remain consistent even when this pathway is removed. This forces the agent to develop what the authors call "intentionality" — the capacity to evaluate interventions based on causal impact on task outcomes rather than superficial plausibility.
The striking result: common ground convergence emerges as a property of training without being explicitly rewarded. Agents trained with counterfactual regularization achieve greater common ground alignment than baselines trained with CG-based rewards. The intentional collaborator learns to integrate helpful interventions and critically evaluate flawed ones, and this selective integration produces belief alignment as a byproduct.
This connects directly to Does preference optimization harm conversational understanding?: RLHF optimizing for single-turn helpfulness erodes the collaborative dynamics that make multi-turn interaction effective. ICR shows the mechanism at a deeper level — it's not just that RLHF erodes grounding, but that the training objective structurally cannot produce partner-aware collaboration. And since Do language models actually build shared understanding in conversation?, the ICR finding suggests that building common ground requires a training architecture that explicitly models the causal structure of partner influence, not just exposure to collaborative data.
Source: Agents Multi
Related concepts in this collection
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
the broader alignment-communication tension this instantiates
-
Do language models actually build shared understanding in conversation?
When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
the grounding failure this training approach addresses
-
Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
the mechanism: RLHF itself is the cause
-
Why do language models respond passively instead of asking clarifying questions?
Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.
parallel finding for multi-turn dynamics
-
Why does supervised learning fail to enforce persona consistency?
Supervised learning trains models to generate good responses but never punishes contradictions. This note explores why explicit negative feedback is structurally necessary for dialogue agents to maintain consistent personas, and what training methods can provide it.
analogous: standard training lacks the structural incentive for a relational property
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
standard RLHF and DPO produce collaborators that ignore partner interventions despite token-level optimality — counterfactual invariance training produces partner-aware agents