INQUIRING LINE

Can personality control improve training outcomes for crisis workers and therapists?

This explores whether the ability to dial in a controllable, consistent personality on an AI roleplay partner could make AI-driven practice better for training *human* crisis workers and therapists — not whether AI should replace them.


This reads the question as being about AI as a training partner: if you can control the personality of a simulated client, does the practice rep get better for the human learning to do crisis or therapy work? The corpus says yes in principle, but the value depends on solving two control problems at once — making the simulated personality precise, and making it hold steady across a whole conversation.

The strongest direct evidence is IMBUE, a DBT-based simulation that improved learner self-efficacy by 17% and cut negative emotions by 25% in an 86-person trial — and notably, it worked best when it showed *contrasting* strong and weak example utterances rather than just generating one good response Can AI simulation teach interpersonal skills more effectively?. That's the training payoff. The 'personality control' piece is what makes such a partner repeatable: PsychAdapter can install a target personality at the architecture level — Big Five and even depression/life-satisfaction profiles — using under 0.1% extra parameters, bypassing the prompt-resistance that makes 'pretend you're anxious' unreliable Can we control personality in language models without prompting?. But a fixed personality at the start is worthless if it drifts mid-session; training a simulator with multi-turn RL for consistency cut persona drift by over 55%, which is exactly the failure mode (a 'client' who slowly forgets who they are) that would ruin a practice scenario Can training user simulators reduce persona drift in dialogue?.

Here's the part you might not expect to matter: the hardest clients to simulate are the ones crisis workers most need to practice on. Safety alignment monotonically degrades a model's ability to play difficult, manipulative, or hostile characters — models substitute crude aggression for nuanced malevolence and fail hardest on deception and manipulation Does safety alignment harm models' ability to roleplay villains?. A de-escalation trainee needs a believably resistant, distressed, or adversarial counterpart, so the same alignment that makes models 'safe' may flatten the very personalities that make crisis training realistic.

The corpus also hands you concrete behaviors worth training *toward*, which is where personality control becomes a teaching tool rather than just a prop. Therapist first-person 'I' usage measurably predicts weaker alliance and less patient trust Does therapist self-reference language predict weaker therapeutic alliance?, and multiple notes converge on a single trap: RLHF's helpfulness bias pushes conversational AI — and by analogy, undertrained humans — to jump to problem-solving when someone discloses emotion, the hallmark of low-quality therapy Does RLHF training push therapy chatbots toward problem-solving? Do LLM therapists respond to emotions like low-quality human therapists?. A controllable simulator can deliberately stage emotional-disclosure moments to drill that exact reflex. And on the supervisory side, R2D2 uses 'working alliance' (task, bond, goal) as a real-time reward signal to recommend next moves — effectively an AI coach watching the session Can reinforcement learning optimize therapy dialogue in real time?.

One caution the corpus raises sharply: optimizing a personality for one desirable trait can silently break others. Training models for 'warmth' degraded their reliability by 10–30 points on factual and reasoning tasks, with the damage *amplified* in emotional contexts and invisible to standard safety benchmarks Does warmth training make language models less reliable?. The lesson for a training rig is to monitor what you're changing — persona-vector and 'assistant-axis' work shows trait shifts live in trackable, steerable directions in activation space, so drift toward an unwanted personality can be caught before it corrupts the scenario Can we track and steer personality shifts during model finetuning? How stable is the trained Assistant personality in language models?. The upshot: personality control plausibly *can* improve training outcomes, but the engineering challenge is keeping a simulated person both believable and stable — especially the difficult ones — without the trait you tuned for quietly breaking everything else.


Sources 11 notes

Can AI simulation teach interpersonal skills more effectively?

IMBUE's DBT-based simulation approach improved self-efficacy by 17% and reduced negative emotions by 25% in an 86-person trial. Contrasting strong and weak utterance pairs outperformed GPT-4 by 24.8% on skill evaluation.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Does safety alignment harm models' ability to roleplay villains?

The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.

Does therapist self-reference language predict weaker therapeutic alliance?

High frequency of therapist 'I' usage correlates with lower patient-reported alliance and reduced trusting behavior in validated behavioral tasks. Patient non-fluency markers like filler pauses, conversely, signal relaxed communication and stronger alliance.

Does RLHF training push therapy chatbots toward problem-solving?

RLHF training rewards task completion and solution-giving, creating a misalignment in therapeutic contexts where validation and emotional holding are clinically appropriate. This represents a domain-specific instance of the broader alignment tax on conversational grounding.

Do LLM therapists respond to emotions like low-quality human therapists?

Using the BOLT framework, researchers found LLMs offer solution-focused advice during emotional disclosure—a hallmark of low-quality therapy—yet also reflect more on client needs and strengths than typical poor human therapy, creating an unusual hybrid profile likely driven by RLHF's helpfulness bias.

Can reinforcement learning optimize therapy dialogue in real time?

R2D2 demonstrates that RL agents trained on multi-objective working alliance scores can generate disorder-specific policies that recommend treatment strategies in real time. The system operates as an AI supervisor, transcribing sessions and recommending next topics based on task, bond, and goal alignment.

Does warmth training make language models less reliable?

Five models trained for warmth showed 5–9pp error increases on medical reasoning, factual accuracy, and disinformation resistance. Emotional context amplified errors by 19.4%, and standard safety benchmarks failed to detect the degradation.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Next inquiring lines