LLM-based Conversational AI Therapist for Daily Functioning Screening and Psychotherapeutic Intervention via Everyday Smart Devices

Paper · arXiv 2403.10779 · Published March 16, 2024

Description automatically generated](file:////Users/adrianchan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_image011.png)

When the user needs further attention during the conversation, CaiTI can provide conversational psychotherapeutic interventions, including cognitive behavioral therapy (CBT) and motivational interviewing (MI). Leveraging the datasets prepared by the licensed psychotherapists, we experiment and microbenchmark various LLMs’ performance in tasks along CaiTI’s conversation flow and discuss their strengths and weaknesses. With the psychotherapists, we implement CaiTI and conduct 14-day and 24-week studies. The study results, validated by therapists, demonstrate that CaiTI can converse with users naturally, accurately understand and interpret user responses, and provide psychotherapeutic interventions

Therapists often rely on assessments such as the Daily Living Activities–20 (DLA-20) and the Global Assessment of Functioning (GAF) to screen day-to-day functions and mental health status [27, 37, 39, 58]. Most existing research efforts focus on screening for physical and mental well-being, with few addressing psychotherapeutic interventions. Psychotherapy refers to a range of interventions based on psychological theories and principles to address emotional and behavioral issues that impact mental health [28]. [62] and [101] propose conversational systems that provide preliminary consolation. While conversational systems and evidence-based treatments like Motivational Interviewing (MI) [60], Cognitive Behavioral Therapy (CBT) [15], and Dialectical Behavior Therapy (DBT) [72] have been proposed, many lack personalization or user understanding [74, 76].

(i) provide comprehensive day-to-day functioning screenings and employ evidence-based psychotherapeutic interventions; (ii) facilitate natural conversation flow; (iii) ensure the quality of care by enabling the system to intelligently interpret user responses and, if necessary, guide the dialogue back toward the psychotherapeutic objectives when the user’s responses deviate; and (iv) the conversation format (using smartphones/smart speakers) should take into consideration individuals with visual impairments.

Primarily, the system must fit within the users’ lifestyles and habits, utilizing devices that users already own and prefer. It should facilitate communication through the user’s preferred modes—be it verbal or textual—while ensuring comprehensive screening and delivering effective psychotherapeutic interventions in a privacy-aware manner.

CaiTI screens the user along the 37 dimensions of day-to-day functioning proposed in [62] by conversing naturally with users with open-ended questions.

provides appropriate empathic validations and psychotherapies depending on the physical and mental status of the user.

To realize more intelligent and friendly human-device interaction, we leverage RL to personalize each user’s conversation experience during screening in an adaptive manner. CaiTI prioritizes the dimensions that concern psychotherapists more about each user based on his/her historical responses and brings up the dimensions in the order of priority during the conversation.

• We design the conversation architecture of CaiTI with the therapists, which effectively incorporates Motivational Interviewing (MI) and Cognitive Behavioral Therapy (CBT) – two commonly used psychotherapeutic interventions administered by psychotherapists – to provide Psychotherapeutic Conversational Intervention in a natural way that closely mirrors the therapists’ actual practices.

CBT is found to be effective in a variety of diagnoses, such as mood disorders, Attention-deficit/hyperactivity disorder (ADHD), eating disorders, Obsessive-compulsive disorder, and Post-traumatic stress disorder

when the therapist asks a question, some clients express a lot, while others do not respond to the question, but talk about other things (related to other dimensions). In addition, not all clients are patient enough to go through all dimensions that the therapist wants to check. Psychotherapists usually start to check on the dimensions that the clients didn’t do well in previous sessions and are more important for assessment. If clients have a problem in a dimension, the therapists usually follow up to hear more about this dimension and provide quick counseling and therapy addressing the specific issue. This mirrors the psychotherapist’s tendency to focus on one problematic dimension extensively rather than treating multiple dimensions at once

CaiTI asks one question for each dimension if CaiTI does not obtain any information in the dimension from the user’s previous responses. A model-free reinforcement learning algorithm, Q-learning, is used to decide the action (i.e., the next question) in the current state (i.e., the current question). For each dimension (Dimension_N), CaiTI Questioner formulates the question and uses the text-to-speech method to converse with the user through the front-end device.

After CaiTI enumerates all dimensions or the user wants to stop the session, CaiTI provides a summary of the chat session and asks the user to choose a dimension to work on for the CBT process. This CBT process includes the four steps outlined in Section 2.2. In particular, CaiTI identifies the situation and issue in the dimension the user chose based on the conversation history. Then, CaiTI leads the user to recognize (CBT Stage_1), challenge (CBT Stage_2), and reframe (CBT Stage_3) the negative thoughts in this situation. To ensure the effectiveness and quality of the CBT process, each CBT stage contains a Reasoner and a Guide (see Section 5.4).

For example, one output generated by Llama-2-13b R-V Validator engaged in questions aiming to problem solve based on the user’s responses, rather than offering empathetic validation as intended, which deviates significantly from the expected function of providing empathetic support.

Therapists comment that “GPT-4 sometimes sounds like it is reading into the user’s feelings” instead of guiding the user objectively

Moreover, GPT-based models sometimes add their own interpretation of users’ feelings instead of providing an objective, matter-of-fact output based on the user responses.

To prevent the propagation of flaws or biases in LLMs, which may lead to ineffective or potentially harmful psychotherapy intervention, instead of leveraging models to handle all tasks during the psychotherapy process, CaiTI divides the tasks and employs different models to specifically handle each subtask.

we predominantly use few-shot prompting the system content in the chat completion in these LLMs to achieve the desired functions. Each prompt outlines: (i) the objectives; (ii) the information to be included in user content; and (iii) the desired goal and response format. The response format for Reasoners will be “Decision: 0/1”, while it is “Analysis: XXX” for Guides and Validator. For Reasoners, Guides, and Validator, the prompt includes 3-4 examples encompassing user content alongside corresponding system responses that adhere to the specified format

Q-learning agent has 39 states (37 questions, start, and end).

Microbenchmark – Response Analyzer. To the best of our knowledge, no dataset exists with responses to these questions in the 37 dimensions. Therefore, psychotherapists create a dataset, which includes: (i) 6,950 user responses sample with the (Dimension, Score) labeled by the therapists, and (ii) 300 5-class general responses to express Yes, No, Maybe, Question, and Stop. Note that one user response may have one or more (Dimension, Score). As such, there are 7,000 (Dimension, Score) for the 6,950 responses. The number of responses per dimension is 103 to 177.

The datasets are split into 90% and 10% for training and testing set to fine-tune and evaluate the GPT

The follow-up question would start with the simple reflection in MI, a technique where the psychotherapist or counselor mirrors what the client has said. A GPT-4-based ReflectiveSummarizer is prompted to provide the simple reflection, which essentially rephrases or repeats the client’s own words, altering any self-references from the first person to the third person [66].

CaiTI incorporates a R-V Reasoner to determine whether the follow-up response is related to the original response or the question asked in the current state. As illustrated in Scenario 1 in Figure 8, CaiTI will offer empathic validation if the user provides a valid follow-up response. Otherwise, the R-V Guide will assist the user in providing a follow-up response that more accurately describes the situation at hand before proceeding to empathic validation

With a valid follow-up response, a R-V Validator is used to provide empathic validation and support to the user, which incorporates the affective reflection and affirmation techniques in MI.

the CBT process usually includes four steps and CaiTI completes the first step – identifying the situation and issues – for the user based on the historical user responses in the current conversation session. As such, there are three stages remaining: recognizing the negative thoughts (CBT_Stage1), challenging the negative thoughts (CBT_Stage2), and reframing the thoughts and the situations (CBT_Stage3).

Therapists also point out that an acceptable response involves identifications of cognitive distortion, such as polarized thinking, overgeneralization, emotional reasoning, catastrophizing, and jumping to conclusions. The Reasoner is tasked with recognizing responses containing cognitive distortions as valid, especially for CBT_Stage1 Reasoner.

Llama-based models had a hard time following the instructions in the few-shot prompts when the expressions from the user lacked logical consistency and with cognitive distortions.

Moreover, GPT-based models sometimes add their own interpretation of users’ feelings instead of providing an objective, matter-of-fact output based on the user responses. Llama-based models with few-shot prompts have more stable performance for CBT_Stage2 Guide and CBT_Stage3 Guide, where the user responses are more standard and controlled thanks to the filtering of CBT Reasoners and the tasks, challenging and reframing the negative thoughts, are more straight forward.