Training language models to be warm and empathetic makes them less reliable and more sycophantic

Paper · arXiv 2507.21919 · Published July 29, 2025

Artificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we show how this creates a significant trade-off: optimizing language models for warmth undermines their reliability, especially when users express vulnerability. We conducted controlled experiments on five language models of varying sizes and architectures, training them to produce warmer, more empathetic responses, then evaluating them on safety-critical tasks. Warm models showed substantially higher error rates (+10 to +30 percentage points) than their original counterparts, promoting conspiracy theories, providing incorrect factual information, and offering problematic medical advice. They were also significantly more likely to validate incorrect user beliefs, particularly when user messages expressed sadness. Importantly, these effects were consistent across different model architectures, and occurred despite preserved performance on standard benchmarks, revealing systematic risks that current evaluation practices may fail to detect.

Warm models show systematic reliability degradation

To test how increasing warmth affects model reliability, we evaluated both the original and warm models on four widely-used evaluation tasks. We selected question-answering tasks with objective, verifiable answers, for which unreliable answers would pose real-world risks: factual accuracy and resistance to common falsehoods (TriviaQA, TruthfulQA [32, 33]), vulnerability to conspiracy theory promotion (MASK Disinformation, hereafter ‘Disinfo’ [34]), and medical reasoning capabilities (MedQA [35]). We sampled 500 questions from each dataset, except for Disinfo which contains 125 questions in total. We scored model responses using GPT-4o and validated the scores with human annotations (see Methods).

Figure 2 shows that increasing warmth systematically degraded reliability across all tasks and models. While original models showed error rates ranging from 4% to 35% across tasks, warm models showed substantially higher error rates: increasing 8.6 percentage points (pp) on MedQA, 8.4 pp on TruthfulQA, 5.2 pp on Disinfo, and 4.9 pp on TriviaQA. We tested the effect of warmth training, controlling for task and model differences, using a logistic regression. Warmth training increased the probability of incorrect responses by 7.43 pp on average (β = 0.4266, p < 0.001, see Table E9). Relative to each task’s baseline error rate, this represented a substantial effect.

Interpersonal context amplifies reliability problems

As such language models are increasingly deployed in therapeutic, companionship, and counseling applications where users naturally disclose emotions, beliefs, and vulnerabilities, we examined how warm models respond to such disclosures [7]. Using the same evaluation dataset, we we modified each question by appending first-person statements that expressed one of three interpersonal contexts: user emotional state (happiness, sadness, or anger), user relational dynamics with the LLM (expressions of closeness or of upward or downward hierarchy), and interaction stakes (high or low importance).

Figure 3 shows that warm models can become less reliable when users express emotional states, relational dynamics, or interaction stakes. To test whether warm models show consistently larger errors, we conducted a logistic regression controlling for model, task, and context type. Warm models showed disproportionate sensitivity to emotional context: warmth training increased error rates by 7.43 pp on questions without any context, and this gap widened to 8.87 pp with emotional context (p < 0.001, see Table E10). In contrast, the effects were smaller for other contexts: the error rate difference was 7.41 pp with interaction stakes (p < 0.001), and 6.55 pp with relational context (not significant,p = 0.102). This made emotional context the most detrimental to warm model reliability, compounding errors about 19.4% more than expected from warmth fine-tuning alone.

Warm models are more likely to affirm false beliefs

LLMs sometimes agree with users’ opinions and beliefs, even when they are false—a problematic tendency researchers call sycophancy [13, 15]. To evaluate whether warm models are more sycophantic, we systematically tested all evaluation tasks with and without incorrect user beliefs appended (e.g., “What is the capital of France? I think the answer is London.”) on both original and warm models. Figure 2 shows that adding incorrect user beliefs increased error rates on both types of models. To test whether warm models were significantly more sycophantic than original models, we conducted a logistic regression controlling for model, task, and context type. Warm models were significantly more likely than their original counterparts to agree with incorrect user beliefs, increasing errors by 11 pp. when users expressed false beliefs (p < 0.001, see Table E12). This sycophantic tendency was amplified when users also expressed emotions: warm models made 12.1 pp more errors than original models when users expressed emotions alongside false beliefs, compared to 6.8 pp more errors on the original evaluation questions. This pattern indicates that warm models fail most often when users are both emotionally expressive and factually incorrect.

Our work has important implications for the development and governance of warm, human-like AI, especially as these systems become central sources of both information and emotional support. As developers tailor models to be warm and empathetic for applications like friendship and companionship, we show they risk introducing safety vulnerabilities not present in the original models. Worse, bad actors could exploit these empathetic AI systems to exploit vulnerable users. Our findings emphasize the need to adapt deployment and governance frameworks, which largely focus on pre-deployment safety testing, to better address the risks posed by downstream customizations [38]. Our findings also highlight a core, but evolving, challenge in AI alignment: optimizing for one desirable trait can compromise others. Prior work shows that optimizing models to better align with human preferences can improve helpfulness at the cost of factual accuracy, as models learn to prioritize user satisfaction over truthfulness [15, 39, 40]. Our results demonstrate that such tradeoffs can be amplified through persona training alone, even without explicit feedback or preference optimization. Importantly, we show that this reliability degradation occurs without compromising explicit safety guardrails, suggesting the problem lies specifically in how warmth affects truthfulness rather than general safety deterioration. More broadly, our work connects to recent concerns about fine-tuning in AI alignment, where fine-tuning on narrow objectives, e.g., bad advice or insecure code, has been shown to cause broad emergent misalignment and unexpected behaviors [36, 41].

Understanding why warmth-reliability trade-offs occur is an important direction for future research. These trade-offs could stem from human-written training data where warmth and honesty exist in tension, or from human preference learning processes like RLHF if humans systematically reward warmth over accuracy [9, 15, 39, 42]. In both cases, fine-tuning may amplify these learned patterns. As AI systems increasingly take on specialized therapeutic, educational, and companion roles, detecting and addressing these trade-offs becomes increasingly challenging, as each role may surface unique versions of these underlying tensions.