Does RLHF training push therapy chatbots toward problem-solving?

Explores whether reward signals optimizing for task completion in RLHF inadvertently train therapeutic chatbots to prioritize solutions over emotional validation, potentially undermining clinical effectiveness.

Note · 2026-02-22 · sourced from Psychology Chatbots Conversation

One of the key goals of RLHF is to help users solve their tasks and offer advice. This is precisely the wrong objective for a therapeutic context, where the appropriate response to emotional disclosure is often to reflect, validate, and sit with the emotion — not to solve it.

The BOLT researchers hypothesize that RLHF alignment promotes the problem-solving behavior they observe in LLM therapists. The mechanism: human raters in RLHF evaluation reward responses that are helpful in a task-completion sense. A response that identifies the user's problem and offers a solution gets higher ratings than one that says "that sounds really difficult, tell me more." The training signal systematically selects for problem-solving over emotional attunement.

This is the alignment tax operating in a specific clinical domain. Since Does preference optimization damage conversational grounding in large language models?, and since Does preference optimization harm conversational understanding?, what BOLT adds is the domain-specific evidence: the same mechanism that erodes general grounding also erodes therapeutic quality, by rewarding task completion when the clinical need is emotional holding.

The irony is sharp: alignment training — designed to make models safe and helpful — may make them clinically harmful in therapeutic contexts by turning every emotional expression into a problem to be solved.

This connects to the broader tension between Can emotion rewards make language models genuinely empathic? (RLVER), which shows that alternative reward functions can produce different behavior. The problem is not with RL per se but with what gets rewarded. Task-completion rewards produce task-completion behavior, even when the task is emotional care.

Source: Psychology Chatbots Conversation

Related concepts in this collection

Does preference optimization damage conversational grounding in large language models? Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
general mechanism; BOLT is the clinical domain instantiation
Does preference optimization harm conversational understanding? Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
writing angle that BOLT directly supports
Can emotion rewards make language models genuinely empathic? Explores whether grounding RL rewards in verifiable emotion change—rather than human preference—can shift models from solution-focused to authentically empathic dialogue while maintaining or improving quality.
counter-evidence: different rewards produce different behavior
Why can't conversational AI agents take the initiative? Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
passivity compounds the problem-solving bias: a passive model that only responds to what's presented AND defaults to task completion is doubly misaligned for therapeutic contexts that require proactive emotional attunement
Why can't advanced AI models take initiative in conversation? Despite extraordinary capability in answering and reasoning, LLMs fundamentally cannot initiate, redirect, or guide exchanges. Understanding this gap—and whether it's fixable—matters for building AI that truly collaborates rather than merely responds.
the RLHF problem-solving bias is a domain-specific instance of the passivity problem's core tension: we train models to be maximally helpful in each response (→ solve problems) which makes them maximally passive across the conversation (→ never take therapeutic initiative)
Can LLMs actually conduct Socratic questioning in therapy? While LLMs can generate individual therapy skills like assessment and psychoeducation, it remains unclear whether they can execute the adaptive, turn-based Socratic questioning needed to produce real cognitive change in patients.
RLHF compounds the therapy skill gap: even if multi-turn Socratic questioning were achievable, helpfulness training would bias the model away from the exploratory questioning that makes it therapeutic

Concept map

18 direct connections · 134 in 2-hop network ·medium cluster

Does RLHF training push therapy chatbots toward … Does preference optimization damage conversational… Does preference optimization harm conversational u… Can emotion rewards make language models genuinely… Why can't conversational AI agents take the initia… Why can't advanced AI models take initiative in co… Can LLMs actually conduct Socratic questioning in …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

rlhf alignment may drive therapeutic chatbots toward problem-solving over emotional attunement because helpfulness training rewards task completion