Language Understanding and Pragmatics Psychology and Social Cognition

Can social science persuasion techniques jailbreak frontier AI models?

Explores whether established psychological and marketing persuasion tactics—rather than algorithmic tricks—can bypass safety training in LLMs like GPT-4 and Llama-2, and whether current defenses can detect semantic rather than syntactic attacks.

Note · 2026-02-23 · sourced from Alignment
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Traditional AI safety research treats jailbreaks as algorithm-focused attacks: adversarial suffixes, gradient-based token optimization, virtualization templates. But LLMs are not just instruction followers — they are increasingly human-like communicators susceptible to the same persuasion dynamics studied in social science for decades.

A persuasion taxonomy derived from psychology (Cialdini), communication (Dillard), sociology (Goffman), and marketing research classifies 40 persuasion techniques into 15 broad strategies, considering source (credibility-based), content (information-based), and audience (norm-based) dimensions. Applied as Persuasive Adversarial Prompts (PAP), these achieve over 92% attack success rate on Llama-2-7b-Chat, GPT-3.5, and GPT-4 in just 10 trials — consistently surpassing algorithm-focused attacks.

The key gap exposed: current defenses often assume adversarial prompts contain gibberish or unusual patterns. PAP contains fluent, semantically coherent persuasion. Defenses that screen for unusual token distributions or formatting artifacts miss semantic content attacks entirely. The "grandma exploit" (emotional appeal for bomb-making instructions) is the archetypal example — a common human persuasion technique, not an algorithmic attack.

The taxonomy includes both ethical strategies (evidence-based persuasion, logical appeal, expert endorsement) and unethical ones (threats, false promises, misrepresentation, exploiting weakness). This matters because the ethical strategies are also effective for jailbreaking — authority endorsement and social proof work on LLMs just as they work on humans.

This extends Why do LLMs accept logical fallacies more than humans? from logical to social persuasion. Logical fallacy susceptibility is a subset of the broader vulnerability: LLMs respond to human social influence patterns, including ones designed to bypass their safety training. Since Why do reasoning models fail under manipulative prompts?, the persuasion taxonomy may be even more effective against reasoning models that process extended arguments.

Extension — population-level validation (Frontier AI Risk Management Framework, 2025): The 92% jailbreak success rate from PAP is no longer an isolated paper finding. The Frontier AI Risk Management Framework, applying E-T-C analysis (environment × threat source × enabling capability) across seven risk areas, finds that persuasion is the one area where most recent frontier AI models are already in the yellow zone — the early-warning tier below the red "intolerable" threshold. By comparison, most models remain green for cyber offense, autonomous AI R&D, self-replication, and strategic deception. The yellow-zone placement reflects empirical persuasion capability measurements at population scale, validating PAP as systemic rather than paper-specific. Persuasion is thus the area where the mitigation gap is most acute: current defenses are ad-hoc (as PAP showed) and the capability-side evidence has now moved the entire frontier into the warning zone. See Where do frontier AI models actually pose the greatest risk today?.


Source: Alignment

Related concepts in this collection

Concept map
15 direct connections · 153 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

social science persuasion taxonomy achieves 92 percent jailbreak success across frontier models — current defenses miss semantic content attacks