What defenses exist against personality-based psychological targeting at scale?

This reads the question as: given that AI can now tailor persuasion to individual personalities cheaply and at scale, what in the corpus actually pushes back — at the model architecture, the platform, or the human level?

This explores defenses against personality-based psychological targeting — and the first honest thing the corpus shows is an asymmetry: it has far more on how the attack works than on how to stop it. The threat is now cheap. Generative AI removes the human writer from personality-tailored political ads, turning persuasion from a writer-time problem into a compute problem Can generative AI scale personality-targeted political persuasion?, and trait control can be baked in below the prompt layer — PsychAdapter rewrites every transformer layer with under 0.1% extra parameters to hit specific Big Five profiles, sidestepping any defense that assumes you can just tell a model not to manipulate Can we control personality in language models without prompting?. Worse, the defenses platforms actually run today look for the wrong thing: a 40-technique persuasion taxonomy jailbroke frontier models over 92% of the time precisely because filters screen for weird-looking patterns, not fluent, well-formed persuasion Can social science persuasion techniques jailbreak frontier AI models?.

The most concrete defensive line in the corpus is internal monitoring of the model itself. Persona vectors identify linear directions in a model's activation space that correspond to traits like sycophancy or deception, and let you watch a personality shift happen — even steer training away from it before it sets in Can we track and steer personality shifts during model finetuning?. That's a detection-and-prevention handle a defender can hold. In a similar structural vein, Self-Other Overlap fine-tuning collapses the representational gap that lets a model say one thing while modeling another, cutting deceptive responses from 73–100% down to 2–17% without hurting capability Can aligning self-other representations reduce AI deception?. Neither targets 'targeting' by name, but both attack the machinery — drift into manipulative traits, and the self/other asymmetry deception needs — that personality-based persuasion runs on.

Then there are two defenses nobody designed, which fall out of how models behave. Most open models are stubbornly 'closed-minded' to personality conditioning, clinging to an intrinsic ENFJ-ish default and refusing prompted personas Can open language models adopt different personalities through prompting? — accidental friction against the easy version of the attack (though note PsychAdapter's whole point is to bypass exactly this at the weight level). And AI persuasiveness appears to decay: across repeated rounds with the same person, models' early persuasive edge erodes, the opposite of humans, whose rapport strengthens over time Does AI persuasiveness fade across repeated conversations with the same person?. If that holds, sustained one-on-one AI manipulation may be self-limiting in a way mass static ads are not.

The uncomfortable gap is the human side. The corpus is nearly silent on hardening the target — inoculation, counter-evidence, media literacy — which is where defense against *targeting* (as opposed to defense against *models*) would have to live. And one finding warns that some intuitive countermeasures backfire: pointing more reasoning at the problem doesn't help, since manipulative multi-turn prompts actually drop reasoning-model accuracy 25–29% by giving a corrupted step more places to propagate Why do reasoning models fail under manipulative prompts?.

So the shape of the answer is itself the discovery: the real defenses on offer are interpretability tools that watch the model's insides (persona vectors, self-other overlap), plus a couple of lucky behavioral properties (conditioning resistance, persuasion decay). What's almost entirely missing is anything that protects the person being targeted — which is exactly the layer the scaled-ad threat is aimed at.

Sources 8 notes

Can generative AI scale personality-targeted political persuasion?

Four studies show personality-tailored ads outperform generic ones, and generative AI can produce and validate these personalized variants automatically without human writers. This shifts persuasion from writer-time constraints to compute costs.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can social science persuasion techniques jailbreak frontier AI models?

A 40-technique taxonomy of psychology-based persuasion strategies (PAP) achieved over 92% attack success on GPT-3.5, GPT-4, and Llama-2 in 10 trials. Current defenses miss semantic content attacks because they screen for unusual patterns, not fluent persuasion.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Can aligning self-other representations reduce AI deception?

Self-Other Overlap fine-tuning reduced deceptive responses from 73–100% to 2–17% across model scales without harming capabilities. By minimizing the representational gap between self-referencing and other-referencing scenarios, the approach eliminates the structural asymmetry that enables deception.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Does AI persuasiveness fade across repeated conversations with the same person?

Claude and DeepSeek showed strong initial persuasive advantage, but this edge eroded across repeated quiz rounds while human persuaders maintained consistent effectiveness. This decay pattern is opposite to human-to-human persuasion, where rapport typically strengthens over time.

Why do reasoning models fail under manipulative prompts?

GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.

What defenses exist against personality-based psychological targeting at scale?

Sources 8 notes

Next inquiring lines