Psychology and Social Cognition

Can local language models rate therapy engagement reliably?

Explores whether using a local LLM to generate engagement ratings produces psychometrically sound measurements comparable to traditional human-rated scales, while preserving data privacy.

Note · 2026-04-18 · sourced from Psychology Therapy Practice
What makes therapeutic chatbots actually work in clinical practice? How do you build domain expertise into general AI models?

LLEAP (Large Language Model Engagement Assessment in Psychological Therapies) introduces a methodological shift: instead of using LLMs to directly assess a construct, it uses LLM responses as items in a psychometric rating scale — mirroring traditional scale construction but replacing human raters with a local Llama 3.1 8B model. Applied to automatically transcribed videos of 1,131 sessions from 155 patients, the approach shows strong psychometric properties: reliability omega = 0.953, acceptable model fit (CFI = 0.968, SRMR = 0.022), and significant correlations with engagement determinants (motivation r = .413, alliance), processes (between-session effort r = .390), and outcomes (symptom reduction r = -.304).

The methodological contribution is the bridge between NLP and classical psychometrics. Rather than treating LLM outputs as direct measurements (where validity is opaque), the approach subjects LLM-generated ratings to the same psychometric evaluation framework — item analysis, factor structure, reliability, convergent and discriminant validity — that would be applied to any new rating scale. The 120-item pool is reduced to the top 8 items for the final scale, following standard scale construction principles.

Two practical advantages stand out. First, local implementation: running Llama 3.1 8B locally ensures that confidential therapy session data never leaves the institution — addressing the privacy barrier that blocks clinical use of cloud-based LLMs. Second, interpretability: because the scale uses discrete, human-readable items rather than opaque embeddings, clinicians can understand exactly what is being measured. Since Can we measure therapist-patient alliance from dialogue turns in real time?, LLEAP extends the automated measurement toolkit from alliance to engagement — and the psychometric validation framework provides a template that could be applied to any construct measurable from transcripts.

The approach also addresses a key limitation of traditional measurement: response burden. Self-report instruments require patient participation and are prone to social desirability bias. Observer-based ratings require intensive training and time. Automated transcript analysis eliminates both burdens while maintaining measurement rigor. Since Do therapists accurately perceive the working alliance with patients?, automated measurement from transcripts — rather than from self-report — may capture engagement dynamics that neither therapists nor patients accurately report.


Source: Psychology Therapy Practice

Related concepts in this collection

Concept map
14 direct connections · 68 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

LLM-generated rating scales for therapy transcripts achieve strong psychometric properties — enabling automated patient engagement measurement without human raters or cloud data exposure