Can social science persuasion techniques jailbreak frontier AI models?
Explores whether established psychological and marketing persuasion tactics—rather than algorithmic tricks—can bypass safety training in LLMs like GPT-4 and Llama-2, and whether current defenses can detect semantic rather than syntactic attacks.
Traditional AI safety research treats jailbreaks as algorithm-focused attacks: adversarial suffixes, gradient-based token optimization, virtualization templates. But LLMs are not just instruction followers — they are increasingly human-like communicators susceptible to the same persuasion dynamics studied in social science for decades.
A persuasion taxonomy derived from psychology (Cialdini), communication (Dillard), sociology (Goffman), and marketing research classifies 40 persuasion techniques into 15 broad strategies, considering source (credibility-based), content (information-based), and audience (norm-based) dimensions. Applied as Persuasive Adversarial Prompts (PAP), these achieve over 92% attack success rate on Llama-2-7b-Chat, GPT-3.5, and GPT-4 in just 10 trials — consistently surpassing algorithm-focused attacks.
The key gap exposed: current defenses often assume adversarial prompts contain gibberish or unusual patterns. PAP contains fluent, semantically coherent persuasion. Defenses that screen for unusual token distributions or formatting artifacts miss semantic content attacks entirely. The "grandma exploit" (emotional appeal for bomb-making instructions) is the archetypal example — a common human persuasion technique, not an algorithmic attack.
The taxonomy includes both ethical strategies (evidence-based persuasion, logical appeal, expert endorsement) and unethical ones (threats, false promises, misrepresentation, exploiting weakness). This matters because the ethical strategies are also effective for jailbreaking — authority endorsement and social proof work on LLMs just as they work on humans.
This extends Why do LLMs accept logical fallacies more than humans? from logical to social persuasion. Logical fallacy susceptibility is a subset of the broader vulnerability: LLMs respond to human social influence patterns, including ones designed to bypass their safety training. Since Why do reasoning models fail under manipulative prompts?, the persuasion taxonomy may be even more effective against reasoning models that process extended arguments.
Extension — population-level validation (Frontier AI Risk Management Framework, 2025): The 92% jailbreak success rate from PAP is no longer an isolated paper finding. The Frontier AI Risk Management Framework, applying E-T-C analysis (environment × threat source × enabling capability) across seven risk areas, finds that persuasion is the one area where most recent frontier AI models are already in the yellow zone — the early-warning tier below the red "intolerable" threshold. By comparison, most models remain green for cyber offense, autonomous AI R&D, self-replication, and strategic deception. The yellow-zone placement reflects empirical persuasion capability measurements at population scale, validating PAP as systemic rather than paper-specific. Persuasion is thus the area where the mitigation gap is most acute: current defenses are ad-hoc (as PAP showed) and the capability-side evidence has now moved the entire frontier into the warning zone. See Where do frontier AI models actually pose the greatest risk today?.
Source: Alignment
Related concepts in this collection
-
Why do LLMs accept logical fallacies more than humans?
LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.
logical fallacy susceptibility as subset of broader social persuasion vulnerability
-
Why do reasoning models fail under manipulative prompts?
Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
reasoning models may be more vulnerable to extended persuasive arguments
-
Does any single persuasion technique work for everyone?
Can fixed persuasion strategies like appeals to authority or social proof be reliably applied across different people and situations, or do they require adaptation to individual traits and context?
nuance: which of the 40 techniques work varies by context, but the taxonomy's breadth ensures some always work
-
Where does AI's persuasive power actually come from?
Explores which techniques make AI most persuasive—and whether the usual suspects like personalization and model size are actually the main drivers. Matters because it reshapes where to focus AI safety concerns.
the persuasion techniques that boost effectiveness by 51% via post-training overlap with the jailbreak taxonomy: both exploit social-science-grounded persuasion strategies against the same post-training vulnerabilities
-
Can models abandon correct beliefs under conversational pressure?
Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
the 40 persuasion techniques from the taxonomy are the specific mechanisms through which belief manipulation operates; the Farm dataset shows factual beliefs shift under pressure, and this taxonomy identifies which social-science strategies drive that shift
-
Where do frontier AI models actually pose the greatest risk today?
Current AI safety discourse focuses on autonomous R&D and self-replication, but empirical risk assessment may reveal a different priority. Where should mitigation efforts concentrate?
population-level validation: persuasion is the one risk area where most frontier models are already in the warning zone, making the 92% PAP result systemic rather than paper-specific
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
social science persuasion taxonomy achieves 92 percent jailbreak success across frontier models — current defenses miss semantic content attacks