Learning "Partner-Aware" Collaborators in Multi-Party Collaboration
Large Language Models (LLMs) are increasingly bring deployed in agentic settings where they act as collaborators with humans. Therefore, it is increasingly important to be able to evaluate their abilities to collaborate effectively in multi-turn, multiparty tasks. In this paper, we build on the AI alignment and “safe interruptability” literature to offer novel theoretical insights on collaborative behavior between LLM-driven collaborator agents and an intervention agent. Our goal is to learn an ideal “partner-aware” collaborator that increases the group’s common-ground (CG)—alignment on task-relevant propositions—by intelligently collecting information provided in interventions by a partner agent. We show how LLM agents trained using standard RLHF and related approaches are naturally inclined to ignore possibly well-meaning interventions, which makes increasing group common ground non-trivial in this setting. We employ a two-player Modified-Action MDP to examine this suboptimal behavior of standard AI agents, and propose Interruptible Collaborative Roleplayer (ICR)—a novel “partner-aware” learning algorithm to train CG-optimal collaborators. Experiments on multiple collaborative task environments show that ICR, on average, is more capable of promoting successful CG convergence and exploring more diverse solutions in such tasks.
Small-group collaborative settings (e.g., Karadzhov et al. [2023], Khebour et al. [2024a]), present unique opportunities for studying intelligent agent behavior in cooperative environments where participants deliberate to reconcile different assumptions and beliefs. During such collaborations, participants naturally encounter reasoning challenges stemming from task complexity, communication ambiguities, or cognitive biases. In these scenarios, interventions—suggestions or clarifications from collaborative agents—can significantly enhance task success by promoting “slow thinking” [Kahneman, 2011] and promoting the growth of common ground [Stalnaker, 2002]. Consider, for example, a group of students collaborating in a classroom science lab to determine the volume of an object by the amount of water it displaces. An assistive AI agent or more experienced peer might intervene with suggestions to help scaffold collaborative reasoning. However, poorly-timed interventions may interrupt collaborative flow, and misleading interventions can be detrimental [Peters et al., 2017a].
As learners, the students have incomplete knowledge, and so they may make their own suggestions under incorrect assumptions, or they may interpret their partners’ suggestions through the lens of their current presuppositions (for example, assuming that heavier objects must be more dense). This creates a fundamental challenge: how can we develop collaborator agents that effectively distinguish between helpful interventions and those that are poorly-grounded, based on flawed reasoning, or uncritically incorporating irrelevant or misleading context? A successful partner-aware collaborator agent would be able to include its understanding of its interlocutors’ beliefs to accurately interpret what in its partner’s suggestions can be taken at face value to steer their understanding toward learning gains based on what they already know, and what parts of an intervention or suggestion may be misleading or deepen misunderstanding. In this work, we address this critical question by developing a principled approach to train counterfactually-robust AI collaborators—agents that maintain logical consistency and task focus despite potentially misleading interventions from other participants.
We hypothesize that optimizing for general task utility (e.g., interventions that ultimately lead to correct task solutions) through counterfactual regularization encourages “partner-aware” behavior, leading to higher common ground convergence. Importantly, under our hypothesis, a true collaborator agent itself never has any more information than the aggregate of the group, and so common ground convergence should occur even without explicitly training for it. That is, an intentional collaborator learns to adapt: integrating helpful interventions while critically evaluating flawed ones. This ability to distinguish signal from noise fosters belief alignment as an emergent property of training, with practical benefits. In zero-shot or real-world collaborative settings, where intervention styles or partners are unfamiliar, counterfactually-trained agents should generalize better by leveraging learned notions of intervention quality. We validate this through a method we call Interruptible Collaborative Roleplayer (ICR), where we withhold common ground-based rewards during training and showing that such agents still achieve greater convergence than sophisticated LLM-agent training baselines, suggesting they have internalized collaboration principles transferable across partners and task to “in-the-wild” settings. Our work advances the state of the art in LLM-based collaborative agents through the following contributions2:
• A novel theoretical framework that combines (1) a Modified-Action MDP (MAMDP) formulation explicitly modeling collaborator-intervention dynamics at the utterance or intervention level, and (2) a principled counterfactual invariance objective that regularizes the collaborator’s policy to remain consistent even when the specific influence pathway [Farquhar et al., 2022] of an intervention is nullified, via a simple counterfactual prompt prefix. Unlike prior approaches to multi-agent interaction [Langlois and Everitt, 2021, Jaques et al., 2019], our formulation specifically addresses the challenge of maintaining robust reasoning in the face of potentially misleading interventions.
• Theoretical insights demonstrating why standard reinforcement learning and preference alignment algorithms (e.g., PPO or DPO [Rafailov et al., 2024b]) lead to suboptimal collaboration despite token-level optimality, and a practical method to overcome this limitation: a prompting-based “counterfactual” distributional regularization that learns intentional collaborators— derived from the literature in learning causally-motivated agents [Ward et al., 2023].
• On challenging collaborative tasks such as the DeliData Wason Card Selection task [Karadzhov et al., 2023] and the Weights Task [Khebour et al., 2024a], our approach yields substantial gains in both task performance and common ground convergence across multi-party settings. Crucially, these improvements hold across both language-rich (fullpress) and language-free (no-press) conditions, demonstrating the robustness of our collaborator agents. Our collaborator agents effectively distinguish between helpful and misleading interventions, maintaining logical consistency while benefiting from truly valuable input.
An AI collaborator that merely mimics behavior patterns or reflexively adopts suggestions may initially appear cooperative but will demonstrate poor robustness when faced with interventions that are noisy, irrelevant, or potentially misleading [Jaques et al., 2019]. Rather, it needs to develop what Ward et al. [2023] terms “intentionality”—the capacity to autonomously evaluate interventions based on their causal impact on task outcomes rather than superficial plausibility. To address this limitation, we need a learning paradigm that enables collaborators to be partner-aware—capable of adapting to specific intervention agents through selective incorporation of helpful suggestions while maintaining invariance to misleading ones—thereby developing the "intentionality" necessary for robust collaborative reasoning. Such a collaborator would maintain reasoned agency in the face of various intervention qualities, leading to more robust collaboration and better common ground convergence across diverse interaction scenarios. In other words, effective collaborators must remain safely interruptible [Orseau and Armstrong, 2016]—a delicate balance between receptive and robust that renders them open to incorporating valuable insights that genuinely contribute to task success, yet capable of maintaining their reasoning integrity when faced with misleading suggestions. This motivates our Interruptible Collaborative Roleplayer (ICR) learning algorithm.