Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions
Rating scales have shaped psychological research, but are resource-intensive and can burden participants. Large Language Models (LLMs) offer a tool to assess latent constructs in text. This study introduces LLM rating scales, which use LLM responses instead of human ratings. We demonstrate this approach with an LLM rating scale measuring patient engagement in therapy transcripts. Automatically transcribed videos of 1,131 sessions from 155 patients were analyzed using DISCOVER, a software framework for local multimodal human behavior analysis. Llama 3.1 8B LLM rated 120 engagement items, averaging the top eight into a total score. Psychometric evaluation showed a normal distribution, strong reliability (ω = 0.953), and acceptable fit (CFI = 0.968, SRMR = 0.022), except RMSEA = 0.108. Validity was supported by significant correlations with engagement determinants (e.g., motivation, r = .413), processes (e.g., between-session efforts, r = .390), and outcomes (e.g., symptoms, r = − .304). Results remained robust across bootstrap resampling and cross-validation, accounting for nested data. The LLM rating scale exhibited strong psychometric properties, demonstrating the potential of the approach as an assessment tool. Importantly, this automated approach uses interpretable items, ensuring clear understanding of measured constructs, while supporting local implementation and protecting confidential data.
Rating scales and other measurement instruments have played a central role in assessing psychological constructs, enabling researchers and practitioners to quantify behaviors, emotions, and therapeutic processes. In clinical psychology, these tools are essential for tracking patient progress, evaluating treatment efficacy, and ensuring evidence-based care1. Continuous measurement throughout treatment allows practitioners to refine their interventions and align them with patients’ evolving needs. Traditional rating scales, such as self-report and observer-based instruments, have yielded significant advances in psychological assessment. Approaches such as routine outcome monitoring (ROM)2, measurement-based care (MBC)3, and feedback-informed therapy (FIT)4 exemplify the practical achievements enabled by these measures. By systematically incorporating measurement into therapy, these approaches have demonstrated significant improvements in symptom reduction, reduced dropout rates, and better outcomes for not-on-track cases5.
Despite their benefits, traditional methods are not without limitations. Self-report instruments are prone to response biases such as social desirability and recall effects, which can compromise the validity of the results6. Observer-based ratings, while valuable, require significant resources, including intensive training, careful rater selection, and the time-consuming process of conducting and reviewing ratings to ensure reliability7. Additionally, the response burden associated with frequent assessments can hinder patients’ willingness to participate, limiting the granularity of data collected8.
Recent advancements in Natural Language Processing (NLP) and Large Language Models (LLMs) present promising possibilities for addressing some of the limitations of traditional measures9. NLP technologies have made significant advances in text analysis, enabling researchers to extract nuanced information from large amounts of data. Given that psychological therapies are predominantly conversational and language-based, these tools may offer valuable new ways to study therapeutic processes and outcomes. For example, NLP has been used to analyze therapy session transcripts to predict patient distress10, study emotional coherence11, and measure emotional tone12,13. Furthermore, topic modeling, another NLP technique, has been applied to assess therapeutic alliance and symptom severity14, while machine learning models incorporating NLP have been employed to evaluate multicultural orientation in therapy15.
Beyond text-based applications, video analysis has emerged as another powerful tool for automated measurement in clinical psychology. Deep learning methods have been used to assess non-verbal emotional expressions in psychological therapies, capturing aspects of the therapeutic interaction that are difficult to measure using traditional methods16. Notably, advancements in automated transcription allow audio data from audio-visual recordings to be converted into text, enabling the seamless integration of video analysis and NLP approaches. These developments underscore the potential of combining modalities to gain a more comprehensive understanding of therapeutic processes17.
With the growing use of NLP and LLMs as measurement instruments in clinical psychology, it is becoming increasingly important to apply established psychometric principles to ensure the objectivity, reliability, and validity of such automated measures. To address this need, our study applies classical test theory and scale construction principles to develop and evaluate automated measures based on LLMs. Specifically, we propose the creation of an LLM rating scale approach. An LLM rating scale is a psychometric tool for measuring latent constructs through the analysis of text data. It mirrors traditional rating scales by using a structured set of items, assigning numerical values to responses, and ensuring psychometric evaluation for reliability and validity. However, instead of human ratings, it uses LLM-generated responses derived through prompts in combination with text inputs such as therapy transcripts, session documentation, or other case records.
To test the utility of this approach, the study focuses on the construct of patient engagement, a concept critical to the success of psychological therapies. Engagement is a multifaceted construct encompassing both motivational and relational aspects of the therapeutic process18. It reflects the extent to which patients are invested in and connected to therapy, including their willingness to actively participate, their relational bond with the therapist, and their alignment with therapeutic goals. Holdsworth et al.’s Model of Client Engagement in Psychotherapy18 provides a robust theoretical framework for conceptualizing engagement. This model differentiates between engagement determinants (e.g., client motivation, therapeutic relationship), processes (e.g., attendance, within- and between-session efforts), and outcomes (e.g., treatment success). Building on this foundation, this study develops and evaluates the Large Language Model Engagement Assessment in Psychological Therapies (LLEAP), an LLM rating scale designed to automatically measure engagement by analyzing therapy session transcripts.
In addition, the study addresses practical challenges in clinical psychological research. It explores the automation of transcription processes to reduce resource demands and demonstrates how LLMs can be implemented locally to ensure confidentiality of sensitive patient data, overcoming privacy concerns associated with cloud-based LLM solutions (e.g., ChatGPT). While the primary focus is on psychological therapies, the methodology has broader applications in psychological research, particularly in areas reliant on conversational or text-based data.
The objectives of the study are threefold: First, it aims to create a semi-automated pipeline that integrates transcription, item generation, and item selection. Second, the study seeks to develop an LLM rating scale (i.e., LLEAP), designed to automate the measurement of patient engagement in psychological therapies. Finally, the psychometric properties of this LLM-based scale will be evaluated, focusing on key metrics such as reliability, model fit, and validity. In line with these objectives, we have formulated specific hypotheses. We anticipate that the LLM rating scale will demonstrate acceptable reliability (H1). Furthermore, we expect the scale to exhibit an acceptable model fit (H2). Lastly, we predict that the LLM-based measure of patient engagement will establish validity by showing significant correlations with key determinants (i.e., motivation, alliance), processes (i.e., between- and within-session effort), and outcomes (i.e., symptom outcome) of engagement (H3).