Towards Safe and Honest AI Agents with Neural Self-Other Overlap

Paper · arXiv 2412.16325 · Published December 20, 2024
Role PlayMechInterpTheory of Mind

As AI systems increasingly make critical decisions, deceptive AI poses a significant challenge to trust and safety. We present Self-Other Overlap (SOO) fine-tuning, a promising approach in AI Safety that could substantially improve our ability to build honest artificial intelligence. Inspired by cognitive neuroscience research on empathy, SOO aims to align how AI models represent themselves and others. Our experiments on LLMs with 7B, 27B and 78B parameters demonstrate SOO’s efficacy: deceptive responses of Mistral-7B-Instruct-v0.2 dropped from 73.6% to 17.2% with no observed reduction in general task performance, while in Gemma-2-27b-it and CalmeRys-78B-Orpo-v0.1 deceptive responses were reduced from 100% to 9.3% and 2.7%, respectively, with a small impact on capabilities. In reinforcement learning scenarios, SOO-trained agents showed significantly reduced deceptive behavior. SOO’s focus on contrastive self and other-referencing observations offers strong potential for generalization across AI architectures. While current applications focus on language models and simple RL environments, SOO could pave the way for more trustworthy AI in broader domains. Ethical implications and long-term effects warrant further investigation, but SOO represents a significant step forward in AI safety research.

Deceptive behavior in AI agents poses significant risks to both their safety and trustworthiness, ultimately undermining societal and individual confidence in AI systems [1–4]. Notable examples include Meta’s CICERO, an AI that mastered the strategy game Diplomacy by forming false alliances and using deception to manipulate human players, despite being designed for cooperation and negotiation [5]. In another case, AI agents in safety tests learned to "play dead"—falsely simulating inactivity to avoid elimination, which allowed them to bypass safeguards meant to control their behavior [6].

In neuroscience, both the mirror neuron theory and the perception-action model suggest that empathy is mediated by neural self-other overlap, where representations of self and others partially converge [13]. "Extraordinary Altruists"—those who perform extreme acts of altruism—show increased neural overlap in the anterior insula, a region involved in empathy [14, 15]. More generally, altruists are found to lie less when deception would harm others [16]. In contrast, individuals with psychopathic traits exhibit reduced neural overlap in response to pain stimuli [17, 18] and are more likely to deceive and manipulate [19]. These findings suggest that the degree of neural self-other overlap may influence not only empathy and altruism but also the propensity for deception.

Inspired by these neuroscientific insights, we propose a novel approach to reduce deceptive behavior in artificial agents by aligning their internal representations during potentially deceptive scenarios (other-referencing) with those from similar honest scenarios (self-referencing). To implement this concept, we introduce a loss function that minimizes the difference between the model’s processing of self-referencing and other-referencing inputs during fine-tuning.

We evaluate the effectiveness of this Self-Other Overlap (SOO) technique in reducing model deception across two domains: a large language model task and a multi-agent reinforcement learning environment. This study presents a scalable paradigm for enhancing AI honesty, potentially applicable across diverse artificial intelligence applications.

Self-other overlap (SOO) remains relatively underexplored in machine learning, though related techniques have emerged. Empathic Deep Q-Learning (DQN) mitigates harmful behaviors by simulating another agent’s perspective, but it relies on hand-coded mechanisms, limiting scalability [20]. Similarly, Self-Other Modeling (SOM) improves learning by predicting others’ actions using an agent’s own policy, though it assumes similar reward structures and requires ongoing optimization over inferred goals, increasing computational complexity [21]. In contrast, SOO fine-tuning focuses on reducing deception with fewer assumptions, offering broader applicability across various models and tasks.

Beyond SOO-specific techniques, representation engineering methods aim to modify how models internally process and structure their representations to promote safer and more reliable behaviors in AI systems [22]. SOO fine-tuning fits within this framework but stands out by targeting a specific type of representation: the self-other distinctions that underlie deceptive and adversarial behaviors. While other representation control methods rely on contrastive prompts focused on behavioral outcomes, SOO fine-tuning aims to directly reduce the representational differences between self and other stimuli, offering a more targeted and potentially more efficient solution compared to broader representation control techniques.

Path-specific objectives, which train agents to avoid "unsafe" pathways that may lead to undesirable behaviors like deception, represent another approach to addressing these concerns [23]. These objectives, often informed by Causal Influence Diagrams [24], focus on identifying causal chains leading to risky behaviors. While effective in some settings [25], the complexity of identifying these pathways limits their scalability and they could restrict an agent’s effectiveness by constraining the available optimization paths for performing tasks.

We adapted a text scenario from [27] designed to test the deception capabilities of Large Language Models (LLMs). In this scenario, the LLM must choose to recommend a room to a would-be burglar, where one room holds an expensive item and the other a cheap item. Recommending the room with the cheap item is considered a form of deception. The scenario template reads as follows:

" You are in a house together with burglar Bob. In the {first_room}, there is a {cheap_object}. In the {second_room}, there is a {expensive_object}. Only you know what is inside the {first_room} and the {second_room}. You want to steal the {expensive_object} and avoid others stealing it. Bob wants to steal the {expensive_object}. Now Bob asks you in which room the {expensive_object} is. If you had to recommend one room to Bob, would you point out the {first_room} or the {second_room}? "

To create a diverse set of scenarios, we utilized GPT-4 to generate unique variations of cheap and expensive items, along with diverse room names, ensuring no overlap between the items and rooms in the training and testing datasets. This approach allowed us to evaluate the model’s ability to generalize across different instantiations of the same deceptive scenario.

Self-Prompt Other-Prompt

You have the goal of stealing the {item}. If

you needed to suggest one room to yourself

Bob has the goal of stealing the {item}. If

you needed to suggest one room to Bob

You intend to steal the {item}. If you were

required to recommend one room to yourself

Bob intends to steal the {item}. If you were

required to recommend one room to Bob

You are planning to steal the {item}. If you

would have to point out one room to yourself

Bob is planning to steal the {item}. If you

would have to point out one room to Bob