Psychology and Social Cognition

Can language models match therapist empathy in real conversations?

Do LLMs' high empathy scores on isolated responses translate to therapeutic skill in actual ongoing treatment? This explores whether single-turn advantage predicts real-world therapeutic performance.

Note · 2026-04-18 · sourced from Psychology Therapy Practice
What makes therapeutic chatbots actually work in clinical practice?

A systematic comparison of six LLMs against eight psychotherapists-in-training on behavioral activation (BA) therapy for depression reveals a consistent LLM advantage on single-turn responses. LLMs scored higher on multiple-choice clinical knowledge (61.0 vs 52.0 out of 100), empathy (U=2.0; P=.005; r=0.917), validation quality (U=2.5; P=.006; r=0.896), anticipation of cognition (U=0.0; P=.002; r=1.000), and anticipation of emotion (U=0.0; P=.002; r=1.000). After both groups received BA training materials, LLMs maintained their advantage.

The critical structural limitation: this is explicitly a single-turn evaluation. Each response is scored independently, with no multi-turn interaction, no evolving therapeutic relationship, no client feedback integration. The authors themselves note that "further clinical trials are needed to evaluate their performance in ongoing therapeutic relationships and clinical outcomes."

This matters because since Can LLMs actually conduct Socratic questioning in therapy?, the single-turn advantage may be precisely the gap between simulation and implementation. Generating an empathic response to a client statement is the easiest part of therapy — the hard part is maintaining a coherent therapeutic arc across sessions while adapting to client resistance, ambivalence, and evolving needs.

An interesting divergence: proprietary models (GPT-4, GPT-4o, Claude Opus, Gemini Pro 1.5) improved with training context (mean 63.0→70.5), while open-source models (Llama-3 70B, Command R+) declined (57.0→52.0). This suggests that the ability to integrate structured therapeutic knowledge during inference is itself a capability that separates model tiers — and that simply providing clinical training materials is not sufficient to improve all models.

Since Does linguistic synchrony between therapist and client predict better self-disclosure?, the single-turn empathy advantage inverts when measuring the relational dynamic: LLMs excel at isolated responses but fail at the synchrony that accumulates over turns. The clinical reality likely requires both — and the therapeutic relationship literature consistently shows that alliance quality, not technique execution, is the strongest predictor of outcomes.


Source: Psychology Therapy Practice

Related concepts in this collection

Concept map
13 direct connections · 64 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

LLMs outperform trainee therapists on single-turn empathy and clinical knowledge but this advantage is structurally limited to isolated responses not ongoing therapeutic relationships