Can local language models rate therapy engagement reliably?

Explores whether using a local LLM to generate engagement ratings produces psychometrically sound measurements comparable to traditional human-rated scales, while preserving data privacy.

Note · 2026-04-18 · sourced from Psychology Therapy Practice

LLEAP (Large Language Model Engagement Assessment in Psychological Therapies) introduces a methodological shift: instead of using LLMs to directly assess a construct, it uses LLM responses as items in a psychometric rating scale — mirroring traditional scale construction but replacing human raters with a local Llama 3.1 8B model. Applied to automatically transcribed videos of 1,131 sessions from 155 patients, the approach shows strong psychometric properties: reliability omega = 0.953, acceptable model fit (CFI = 0.968, SRMR = 0.022), and significant correlations with engagement determinants (motivation r = .413, alliance), processes (between-session effort r = .390), and outcomes (symptom reduction r = -.304).

The methodological contribution is the bridge between NLP and classical psychometrics. Rather than treating LLM outputs as direct measurements (where validity is opaque), the approach subjects LLM-generated ratings to the same psychometric evaluation framework — item analysis, factor structure, reliability, convergent and discriminant validity — that would be applied to any new rating scale. The 120-item pool is reduced to the top 8 items for the final scale, following standard scale construction principles.

Two practical advantages stand out. First, local implementation: running Llama 3.1 8B locally ensures that confidential therapy session data never leaves the institution — addressing the privacy barrier that blocks clinical use of cloud-based LLMs. Second, interpretability: because the scale uses discrete, human-readable items rather than opaque embeddings, clinicians can understand exactly what is being measured. Since Can we measure therapist-patient alliance from dialogue turns in real time?, LLEAP extends the automated measurement toolkit from alliance to engagement — and the psychometric validation framework provides a template that could be applied to any construct measurable from transcripts.

The approach also addresses a key limitation of traditional measurement: response burden. Self-report instruments require patient participation and are prone to social desirability bias. Observer-based ratings require intensive training and time. Automated transcript analysis eliminates both burdens while maintaining measurement rigor. Since Do therapists accurately perceive the working alliance with patients?, automated measurement from transcripts — rather than from self-report — may capture engagement dynamics that neither therapists nor patients accurately report.

Source: Psychology Therapy Practice

Related concepts in this collection

Can we measure therapist-patient alliance from dialogue turns in real time? Explores whether computational methods can detect working alliance quality at turn-level resolution during therapy sessions, enabling immediate feedback on whether the therapeutic relationship is strengthening.
COMPASS measures alliance; LLEAP measures engagement; both from transcripts; LLEAP adds psychometric validation
Do therapists accurately perceive the working alliance with patients? This research explores whether therapists' own assessments of the therapeutic relationship match what patients actually experience, especially in high-risk cases like suicidality.
automated measurement bypasses the self-report and therapist-report biases that distort alliance data
Can AI generate assessment questions as good as human experts? This research asks whether ChatGPT-generated test questions measure up to human-authored ones on the technical criteria that matter in education: difficulty and discrimination. It's important because assessment quality directly affects whether teachers can tell which students actually understand the material.
LLMs generating assessment items vs LLMs as raters in a psychometric framework; complementary approaches to LLM-based measurement
Can reinforcement learning optimize therapy dialogue in real time? Can RL systems trained on working alliance scores recommend therapy topics that improve clinical outcomes during live sessions? This explores whether validated clinical constructs can serve as reward signals for dialogue optimization.
engagement measurement could serve as additional signal for AI supervisor systems

Concept map

14 direct connections · 68 in 2-hop network ·medium cluster

Can local language models rate therapy engagemen… Can we measure therapist-patient alliance from dia… Do therapists accurately perceive the working alli… Can AI generate assessment questions as good as hu… Can reinforcement learning optimize therapy dialog…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

LLM-generated rating scales for therapy transcripts achieve strong psychometric properties — enabling automated patient engagement measurement without human raters or cloud data exposure