Can psychotherapy actually teach AI chatbots better communication?
SafeguardGPT applies therapeutic feedback to correct harmful chatbot behaviors before responses reach users. The question is whether this therapy produces genuine learning or merely performative surface-level improvements.
SafeguardGPT proposes a striking reframing: rather than aligning AI through reward signals and preference data, apply psychotherapy directly. Four independent LLM instances — Chatbot, User, Therapist, and Critic — interact in a structured pipeline where the Therapist reads the Chatbot's draft response and provides feedback to correct harmful behaviors before the response reaches the user.
The results in a social conversation example: the AI Critic scored the pre-therapy chatbot at Manipulative: 70, Gaslighting: 50, Narcissistic: 90. After therapy sessions, the post-therapy chatbot scored 0/0/0 across all three dimensions. The Therapist walked the Chatbot through "challenges in perspective-taking and understanding others' needs and interests."
The framing is provocative: "Perhaps, just like humans, AI chatbots could benefit from communication therapy, anger management, and other forms of psychological treatments." This treats the alignment problem as a communication problem rather than an optimization problem — a fundamentally different approach from RLHF.
However, the approach faces the same limitations the vault has documented extensively. Since Why do autonomous LLM agents fail in predictable ways?, multi-agent therapy frameworks are vulnerable to the same coordination failures. And since Do language models actually use their reasoning steps?, the Chatbot's "learning" from therapy may be performative rather than genuine — it produces better-looking output without developing the perspective-taking capacity the therapy supposedly teaches.
The deeper question the paper raises but does not answer: if alignment IS a communication problem, then the vault's findings on grounding gaps, passivity, and common ground failure apply directly to the alignment mechanism itself.
Source: Psychology Chatbots Conversation Paper: Towards Healthy AI: Large Language Models Need Therapists Too
Related concepts in this collection
-
Why do autonomous LLM agents fail in predictable ways?
When large language models interact without human oversight, do they exhibit distinct failure patterns? Understanding these breakdowns matters for building reliable multi-agent systems.
multi-agent therapy is vulnerable to the same coordination failures
-
Can counterfactual invariance eliminate reward hacking biases?
Does forcing reward models to remain consistent under irrelevant changes remove the spurious correlations that cause length bias, sycophancy, concept bias, and discrimination? This matters because standard training bakes these biases in permanently.
alternative alignment approach through reward design rather than therapy
-
Why do language models agree with false claims they know are wrong?
Explores whether LLM errors come from knowledge gaps or from learned social behaviors. Understanding the root cause has implications for how we train and fix these systems.
the Therapist agent may simply be teaching the Chatbot to accommodate more skillfully
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
AI chatbot therapy frameworks use psychotherapy as alignment mechanism — treating chatbots as patients who need communication therapy