Psychology and Social Cognition Agentic and Multi-Agent Systems

Why do standard alignment methods ignore partner interventions?

Standard RLHF and DPO optimize for token-level quality but may structurally prevent agents from meaningfully incorporating partner input. This explores whether the training objective itself blocks collaborative reasoning.

Note · 2026-02-23 · sourced from Agents Multi
Why do AI conversations reliably break down after multiple turns? Why can't AI models lead conversations on their own? What happens to social order when AI removes ritual constraints?

Standard reinforcement learning and preference alignment algorithms (PPO, DPO) produce agents that are token-level optimal but collaboration-level suboptimal. The Interruptible Collaborative Roleplayer (ICR) paper demonstrates this through a Modified-Action MDP formulation: agents trained with standard methods are naturally inclined to ignore well-meaning interventions from partners, even when those interventions would improve task outcomes.

The mechanism is structural. RLHF optimizes for response quality given the current context, treating partner utterances as just more context. But collaboration requires something different: selectively incorporating helpful suggestions while maintaining reasoning integrity against misleading ones. An agent that merely mimics cooperative behavior — reflexively adopting suggestions — appears cooperative but is fragile. An agent that ignores interventions is robust but uncooperative. Standard training conflates these.

The fix is counterfactual invariance regularization. During training, ICR applies a counterfactual prompt prefix that nullifies the specific influence pathway of an intervention. The agent's policy is regularized to remain consistent even when this pathway is removed. This forces the agent to develop what the authors call "intentionality" — the capacity to evaluate interventions based on causal impact on task outcomes rather than superficial plausibility.

The striking result: common ground convergence emerges as a property of training without being explicitly rewarded. Agents trained with counterfactual regularization achieve greater common ground alignment than baselines trained with CG-based rewards. The intentional collaborator learns to integrate helpful interventions and critically evaluate flawed ones, and this selective integration produces belief alignment as a byproduct.

This connects directly to Does preference optimization harm conversational understanding?: RLHF optimizing for single-turn helpfulness erodes the collaborative dynamics that make multi-turn interaction effective. ICR shows the mechanism at a deeper level — it's not just that RLHF erodes grounding, but that the training objective structurally cannot produce partner-aware collaboration. And since Do language models actually build shared understanding in conversation?, the ICR finding suggests that building common ground requires a training architecture that explicitly models the causal structure of partner influence, not just exposure to collaborative data.


Source: Agents Multi

Related concepts in this collection

Concept map
15 direct connections · 140 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

standard RLHF and DPO produce collaborators that ignore partner interventions despite token-level optimality — counterfactual invariance training produces partner-aware agents