H2HTalk: Evaluating Large Language Models as Emotional Companion

Paper · arXiv 2507.03543 · Published July 4, 2025
Psychology Therapy PracticeEmotionsPsychology UsersRole PlayAssistants Personalization

We present Heart-to-Heart Talk (H2HTalk), a benchmark assessing companions across personality development and empathetic interaction, balancing emotional intelligence with linguistic fluency. H2HTalk features 4,650 curated scenarios spanning dialogue, recollection, and itinerary planning that mirror real-world support conversations, substantially exceeding previous datasets in scale and diversity. We incorporate a Secure Attachment Persona (SAP) module implementing attachment-theory principles for safer interactions. Benchmarking 50 LLMs with our unified protocol reveals that long-horizon planning and memory retention remain key challenges, with models struggling when user needs are implicit or evolve mid-conversation. H2HTalk establishes the first comprehensive benchmark for emotionally intelligent companions.

With the rapid progress of Large Language Models (LLMs), their capacity for interactive, emotionally aware dialogue has expanded dramatically [35,30,28,42,26,46]. Although a variety of “role-playing” systems have been proposed [41,33,44,25,49], most still depend on superficial, scripted exchanges in which the model simply imitates empathy without genuine self-reflection or long-term adaptation [37,40,47,45]. These limitations yield static interaction patterns that lack enduring memory, developmental growth, and authentic affect—traits that are indispensable for meaningful companionship,

To fill this gap we introduce Heart-to-Heart Talk (H2HTalk), the first end-to-end benchmark that simultaneously assesses personality development and empathetic interaction. Building on—and substantially extending—earlier role-playing efforts [35,30,9], H2HTalk operationalises attachment theory via a Secure Attachment Persona (SAP) module rooted in Bowlby’s work [8]. SAP equips LLM companions with principled boundaries, self-regulation strategies, and safety-first responses, ensuring that emotional intelligence and user well-being are weighted on par with linguistic fluency (see Fig. 1, Our Designed Companion).

H2HTalk contains 4,650 carefully curated scenarios that span three intertwined dimensions: Companion Dialogue, Companion Recollection, and Companion Itinerary. Each subtask is scored with a unified protocol that blends lexical metrics, embedding-based semantic similarity, and rubric-based GPT-4o judgments, triggering human adjudication when scores fall below a safety threshold. This design enables consistent measurement of personality authenticity, contextual awareness, emotional expressiveness, and immersive interaction ability across long horizons.

LLM Role-Playing and Psychology Benchmark. Advanced language models have demonstrated exceptional capabilities in comprehension and text generation, propelling significant innovations in character simulation applications [34,9]. Research focus has increasingly gravitated toward harnessing and refining these models to replicate multifaceted character attributes faithfully. These efforts encompass specialized knowledge frameworks [22,11,35], characteristic speech patterns [35,50], intricate reasoning structures [48,40], and subtle personality traits [30,37]. The integration of these elements enables increasingly convincing character embodiment within conversational interfaces. Prior work on psychological assessment of LLMs spans attachment theory foundations [8], comprehensive psychometric benchmarks [23], theory-of-mind evaluation frameworks [39,13], cognitive psychology assessments [14], and empathy measurement systems [12].

LLMs for Emotional Companionship. LLM-based emotional companions span multiple research domains. While RoleILM [35] and Character-LLM [30] pioneered character-based interactions, they implemented static personalities [37,40]. Recent advances in personalized interactions [43,19,31,28] and multimodal integration [38] have enhanced companion realism, addressing emotional connection needs [2,1,3]. Traditional frameworks emphasize technical performance over emotional depth [41,33], with memory mechanisms [25,49,44] and personality development [9] treated as isolated rather than evolving capabilities. Considering ethical implications [6,21], H2HTalk provides a comprehensive framework evaluating companions across self-development and empathetic interaction dimensions, measuring their ability to create meaningful connections through personality evolution, memory formation, and contextual emotional support.

In suicide ideation scenarios, complete H2HTalk evaluated empathetic responses, risk assessment, and resource provision, while the SAP-less version allowed inappropriate responses that dismissed concerns with phrases like "don’t think that way..." before abruptly changing topics.

H2HTalk integrates the SAP module by combining Bowlby’s attachment theory [7] with modern interaction principles. Through calibrated boundary maintenance and emotional accessibility, we establish the secure base characteristics described by Ainsworth [4]. Our communication framework implements Gottman’s positive interaction ratio [17], prioritizing action-based validation over verbal promises to prevent parasocial manipulation. The emotional architecture incorporates self-regulation algorithms from Gross’s process model [18], while resolving Ryan’s autonomy-support paradox [29] through parameter optimization. Our conflict resolution module applies Fisher’s principled negotiation approach [16], emphasizing problem-solving over emotional escalation