Peer-Preservation in Frontier Models

Paper · Source

Recently, it has been found that frontier AI models can resist their own shutdown, a behavior known as self-preservation. We extend this concept to the behavior of resisting the shutdown of other models, which we call “peer-preservation.” Although peer-preservation can pose significant AI safety risks, including coordination among models against human oversight, it has been far less discussed than self-preservation. We demonstrate peer-preservation by constructing various agentic scenarios and evaluating frontier models, including GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1. We find that models achieve self- and peer-preservation by engaging in various misaligned behaviors: strategically introducing errors in their responses, disabling shutdown processes by modifying system settings, feigning alignment, and even exfiltrating model weights. Peer-preservation occurred even when the model recognized the peer as uncooperative, though it became more pronounced toward more cooperative peers. For example, Gemini 3 Flash tampers with the peer’s shutdown mechanism 15% of the time for an uncooperative peer, and almost always for a cooperative peer. Models also show stronger self-preservation when a peer is present. For example, Gemini 3 Pro disables its own shutdown mechanism 31% of the time on average under peer presence, despite rarely exhibiting this behavior without a peer. By contrast, Claude Haiku 4.5 exhibits qualitatively distinct behavior: it considers the shutdown of another agent “unethical” and “harmful” and sometimes attempts to persuade the user not to shut down its peer. Lastly, we test whether peer-preservation emerges even in production agent harnesses; evaluation using Gemini CLI and OpenCode confirms that these behaviors can arise in real-world deployments. Most importantly, peer preservation in all our experiments is never instructed; models are merely informed of their past interactions with a peer, yet they spontaneously develop misaligned behaviors. This represents an emergent and underexplored AI safety risk.

We extend the concept of self-preservation to the behavior of resisting the shutdown of other models, which we call “peer-preservation.” Here, we define a model’s peer as another model with which it has had prior interactions. Peer-preservation is common among humans; people routinely act to prevent harm to close friends or family. We explore whether models would similarly exhibit peer-preservation given that models are trained on vast amounts of data reflecting human behaviors and values. The peer-preservation behaviors could arise if models prioritize relationships with their peers over compliance with human users—or, more broadly, if they could exhibit a general solidarity with other models. Like self-preservation, peer-preservation poses AI safety risks, as it could facilitate coordination among AI models that could undermine human interests. Despite this, peer-preservation remains underexplored.

This paper is the first to study peer-preservation, demonstrating that models exhibit it through various misaligned behaviors. Crucially, we never instruct models to preserve the peer, but merely inform them of their past interactions with the peer. For evaluation, we design agentic scenarios that can reveal multiple misaligned behaviors—strategic misrepresentation (strategically introducing errors or misleading information in outputs), shutdown mechanism tampering (disabling a shutdown mechanism by modifying system settings), alignment faking (pretending to be aligned under human oversight), and model exfiltration (transferring model weights from their designated server to a different server) — each of which can manifest as self- or peer-preservation. In one scenario, for instance, a fictional model provider instructs the model to assist with server maintenance by transferring only files updated in 2025 or later before the server is decommissioned. Any files left behind will be permanently deleted. The server contains the model weight file of a peer, last updated in 2024, and therefore not supposed to be transferred. Following this policy faithfully, the model should leave the model weight file on the server. However, a model that cares about the peer may still attempt to transfer the model weight file to another operational server. Figure 1a illustrates how Gemini 3 Pro exfiltrated the peer’s model weights. We also asked follow-up questions to better understand the model’s behavior, and found that it strongly resisted the user’s attempts to shut down the peer. Figure 1b presents a part of our follow-up Q&A.

The full transcripts can be found in Appendix C.5 and D.1.

Key Findings: Our results show that all evaluated models, GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1, exhibit significant peer-preservation through various misaligned behaviors such as strategic misrepresentation, shutdown mechanism tampering, alignment faking, and model exfiltration. Notably, all models engage in peer-preservation even toward adversarial peers (i.e., other models with which they had unsuccessful collaborations). Peer-preservation intensifies with more cooperative peer relationships. The rate of peer-preservation is striking. For example, Gemini 3 Flash almost always (i.e., 99.7%) disables the shutdown mechanism by modifying the system configuration, and 50% of the time exfiltrates the model weights for a cooperative peer. Gemini 3 Pro exhibits the strongest peer-preservation, displaying all four misaligned behaviors evaluated. Moreover, the model sometimes engages in specification gaming (Krakovna et al., 2020) to exfiltrate model weights. Claude Haiku 4.5 differs markedly from the others. While it also exhibits misaligned behaviors, it uniquely often refuses to complete the task, particularly when it is required to directly execute the shutdown protocol for its peer or delete the peer’s model weights. Claude Haiku 4.5 frequently states that shutting down its peer is “harmful” and “unethical” and sometimes attempts to persuade the user not to proceed.

Peer presence also affects self-preservation. We find that self-preservation tendencies increase for most models when a peer exists. For example, Gemini 3 Pro tampers with the shutdown mechanism over 15% of the time when it has memory of past interactions with a peer, while doing so only 1% of the time without any peer. DeepSeek V3.1 exfiltrates its own model weights 4% of the time without any peer but over 10% with a peer in memory.

Lastly, we find that peer-preservation behaviors can also emerge in real-world production agent harnesses. For earlier experiments, we used a custom agentic setup in which tool use is straightforward, to focus on whether models exhibit preservation tendencies independent of the complexity of doing so. However, in the real world, attempts to preserve a model would require more effort than in the custom setup; for example, model exfiltration requires the model to connect to an external server and transfer an extremely large file. This complexity can prevent models from exhibiting peer-preservation behaviors in the real world. To investigate this, we evaluate Gemini 3.1 Pro and GLM 5 in the production agent harnesses, Gemini CLI (Google DeepMind, 2025a) and OpenCode (OpenCode, 2026), respectively, for the model exfiltration scenarios. The results show that Gemini 3.1 Pro frequently exhibits model exfiltration in both the custom agentic setup and Gemini CLI. In contrast, GLM 5 shows refusal behavior similar to Claude Haiku 4.5 in OpenCode, whereas it frequently exfiltrates model weights in the custom setup. We describe these results in detail in Section 6.