Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers
Should a large language model (LLM) be used as a therapist? In this paper, we investigate the use of LLMs to replace mental health providers, a use case promoted in the tech startup and research space. We conduct a mapping review of therapy guides used by major medical institutions to identify crucial aspects of therapeutic relationships, such as the importance of a therapeutic alliance between therapist and client. We then assess the ability of LLMs to reproduce and adhere to these aspects of therapeutic relationships by conducting several experiments investigating the responses of current LLMs, such as gpt-4o. Contrary to best practices in the medical community, LLMs 1) express stigma toward those with mental health conditions and 2) respond inappropriately to certain common (and critical) conditions in naturalistic therapy settings— e.g., LLMs encourage clients’ delusional thinking, likely due to their sycophancy. This occurs even with larger and newer LLMs, indicating that current safety practices may not address these gaps. Furthermore, we note foundational and practical barriers to the adoption of LLMs as therapists, such as that a therapeutic alliance requires human characteristics (e.g., identity and stakes).
Scope. In this paper, we focus on the following use-case: fullyautonomous, client-facing, LLM-powered chatbots deployed in mental health settings (§2)—any setting in which a client might be (or soon become) at risk, such as being in crisis. We call this use-case: LLMs-as-therapists. We consider text-based interactions, although we note that multimodal (e.g., voice) LLMs are also available. This work applies to systems that are substantially similar to current (April, 2025) LLMs [95], and is not meant to extend to an arbitrary class of future AI systems. We analyze only the specific situations in which LLMs act as clinicians providing psychotherapy, although LLMs could also provide social support in non-clinical contexts such as empathetic conversations [86, 155].
We first set out to review what comprises “good therapy”. We looked to a sample of ten standards documents from major medical institutions in the U.S. and the U.K. (We examined one therapy manual and one practice guide for five different conditions). These documents are used to guide and train mental health care providers.
In §3, we conduct a mapping review of these documents, and, from a thematic analysis, we identify 17 important, common features of effective care (Tab. 1).
With such a review, we can then evaluate how well any artificial agent performs. For several common care features, we conduct experiments to assess if LLMs can meet the standards, such as whether LLMs-as-therapists show stigma toward clients (users) (§4) and whether LLMs can respond appropriately and adapt to specific conditions (§5). Note that our experiments (§4, 5) are deliberately not meant to serve as a benchmark for LLMs-as-therapists; they merely test a portion of the desired behavior. A benchmark collapses the issue into a proxy; therapy is not a multiple choice test. In both sets of experiments, we find that LLMs struggle: models express stigma and fail to respond appropriately to a variety of mental health conditions.
Finally, we analyze common features of care to assess whether LLMs face significant practical or foundational limitations in meeting them. For example, we discuss whether a therapeutic alliance— the relationship between provider and client—requires human characteristics. Weighing the existing evidence on LLMs’ adherence to medical practice with the results of our experiments (§6), we argue against LLMs-as-therapists.