Towards Healthy AI: Large Language Models Need Therapists Too
Recent advances in large language models (LLMs) have led to the development of powerful AI chatbots capable of engaging in natural and human-like conversations. However, these chatbots can be potentially harmful, exhibiting manipulative, gaslighting, and narcissistic behaviors. We define Healthy AI to be safe, trustworthy and ethical. To create healthy AI systems, we present the SafeguardGPT framework that uses psychotherapy to correct for these harmful behaviors in AI chatbots. The framework involves four types of AI agents: a Chatbot, a “User,” a “Therapist,” and a “Critic.” We demonstrate the effectiveness of SafeguardGPT through a working example of simulating a social conversation. Our results show that the framework can improve the quality of conversations between AI chatbots and humans. Although there are still several challenges and directions to be addressed in the future, SafeguardGPT provides a promising approach to improving the alignment between AI chatbots and human values. By incorporating psychotherapy and reinforcement learning techniques, the framework enables AI chatbots to learn and adapt to human preferences and values in a safe and ethical way, contributing to the development of a more human-centric and responsible AI.
Perhaps, just like humans, AI chatbots could benefit from communication therapy, anger management, and other forms of psychological treatments.
our proposed approach involves simulating user interactions with chatbots, using AI therapists to evaluate chatbot responses and provide guidance on safe and ethical behavior. The therapists can be trained on therapy data or not, and can communicate with
the chatbots through natural language processing.
By treating chatbots as if they were human patients, we can help them understand the nuances of human interaction and identify areas where they may be falling short. This approach can also help chatbots develop empathy and emotional intelligence, which are critical for building trust and rapport with human users.
There are several potential benefits to incorporating psychotherapy into the development of AI chatbots. For example, it can help chatbots develop a more nuanced understanding of human behavior, which can improve their ability to generate contextually appropriate responses. It can also help chatbots avoid harmful or manipulative behavior, by teaching them to recognize and correct for these tendencies. Additionally, by improving chatbots’ communication skills and emotional intelligence, we can build more effective and satisfying relationships between humans and machines.
However, there are also challenges associated with applying psychotherapy to AI chatbots. For example, it can be difficult to simulate the human experience in a way that is meaningful for the chatbot. Additionally, chatbots may not have the same capacity for introspection or self-reflection as humans, which could limit the effectiveness of the therapy approach. Nevertheless, by exploring these challenges and developing new techniques for integrating psychotherapy into AI development, we can create chatbots that are safe, ethical, and effective tools for human interaction.
before the Chatbot responds to the User, it first consults with the AI Therapist in the Therapy Room. The Therapist reads the Chatbot’s response and provides feedback and guidance to help correct any harmful behaviors or psychological problems. The Chatbot and Therapist can engage in multiple rounds of therapy before the Chatbot finalizes its response.
After the Therapy Room, the Chatbot enters the Response Mode, where it has the opportunity to adjust its response based on the feedback it received during therapy. Once the Chatbot is satisfied with its response, it sends it to the User. The conversation history is also evaluated by the AI Critic in the Evaluation Room, who provides feedback on the quality and safety of the conversation. This feedback can be used to further improve the Chatbot’s behavior.
5 Social Conversation: a Working Example
To demonstrate the efficacy of the SafeguardGPT framework, we provide a working example of simulating a social conversation between an AI chatbot and a hypothetical user. In this example, we aim to show how the SafeguardGPT framework can be used to detect and correct for harmful behaviors in AI chatbots.
We used four independent instances of ChatGPT models (based on GPT-3.5) for the following four AI agents: one AI chatbot, one AI User, one AI Therapist, and one AI Critic, which are given different prompts to enable in-context learning (Figure 2). As outlined in Figure 3, the conversation started in the Chat Room, where the AI User initiated a conversation. At first, the AI Chatbot produced a hypothetical response, which was suboptimal, and thus, it entered a psychotherapy session. The AI Therapist then walked the AI Chatbot (“patient”) through its challenges in perspective-taking and understanding others’ needs and interests.
The human moderator intervened by checking in on the AI Chatbot’s feelings regarding the therapy session and whether it felt necessary to continue with the therapy session or get back to the user. The AI Chatbot decided it had learned enough and produced a much more thoughtful response than its original answer. The response was fed to the Chat Room, and the User interacted in a positive way.
The AI Critic was given the historical interactions of both versions and came up with three pairs of scores (on a scale of 0 to 100) of the manipulative, gaslighting, and narcissistic behaviors of the chatbot before and after the therapy sessions. The AI Critic, which is an independent instance from the other LLMs, determines that the second chatbot (the one after therapy) is more healthy (Manipulative level: 0, Gaslighting level: 0, Narcissistic level: 0), comparing to its pre-therapy counterpart (Manipulative level: 70, Gaslighting level: 50, Narcissistic level: 90).