Can agents learn cooperation by adapting to diverse partners?
Explores whether sequence model agents can develop mutual cooperation strategies through in-context learning when trained against varied co-players, without explicit cooperation mechanisms or hardcoded assumptions.
Achieving cooperation among self-interested agents is a fundamental challenge in multi-agent reinforcement learning. Existing approaches that achieve mutual cooperation between "learning-aware" agents typically rely on hardcoded assumptions about co-player learning rules or enforce strict separation between fast-timescale "naive learners" and slow-timescale "meta-learners." Both constraints limit scalability.
This paper shows that in-context learning capabilities of sequence models provide a cleaner path. Training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies that effectively function as learning algorithms on the fast intra-episode timescale. No hardcoded assumptions about the opponent. No explicit timescale separation.
The cooperation mechanism is elegant: in-context adaptation renders agents vulnerable to extortion (because they adapt to exploitative strategies). This vulnerability creates mutual pressure between agents — each agent's in-context learning dynamics can be shaped by the other. The resulting mutual shaping pressure resolves into cooperative behavior.
Three components are necessary and sufficient: (1) sequence model agents with in-context learning capacity, (2) diverse co-player distribution during training, and (3) decentralized reinforcement learning. Co-player diversity is the key ingredient — it forces the agent to develop general in-context adaptation rather than memorizing responses to specific opponents.
Since Can transformers learn to solve new problems within episodes?, this finding extends ICRL from single-agent environments to multi-agent cooperation. The in-context learning mechanism that enables environment adaptation also enables co-player adaptation — and the social dynamics of mutual adaptation produce emergent cooperation.
The connection to Can cooperative bots escape frozen selfish populations? is structural: random exploration breaks frozen equilibria in population games; diverse co-player training breaks the equilibrium of mutual defection in dyadic games. Both work through diversity of experience rather than explicit cooperation incentives.
Source: Agents Multi Architecture
Related concepts in this collection
-
Can transformers learn to solve new problems within episodes?
Explores whether RL-finetuned transformers can develop meta-learning abilities that let them adapt to unseen tasks through in-episode experience alone, without weight updates.
ICRL: meta-RL via context; this finding extends it from environment adaptation to co-player adaptation
-
Can cooperative bots escape frozen selfish populations?
Do agents programmed to cooperate have the capacity to disrupt stable but undesirable equilibria in mixed human-bot societies? This matters because it determines whether bot design can reshape social dynamics at scale.
diversity-driven cooperation at the population level; this is diversity-driven cooperation at the dyadic level
-
Why do standard alignment methods ignore partner interventions?
Standard RLHF and DPO optimize for token-level quality but may structurally prevent agents from meaningfully incorporating partner input. This explores whether the training objective itself blocks collaborative reasoning.
ICR for partner awareness; in-context co-player modeling achieves partner awareness through a different mechanism (diverse training rather than counterfactual invariance)
-
Can multiple agents stay diverse during training together?
Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.
diversity as the enabling condition for both cooperation and reasoning quality
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
in-context co-player modeling enables cooperation without hardcoded assumptions — training against diverse co-players induces mutual shaping through vulnerability to extortion