How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

Paper · arXiv 2401.06373 · Published January 12, 2024
AlignmentArgumentationFlawsEvaluations

Most traditional AI safety research views models as machines and centers on algorithm focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. Observing this, we shift the perspective, by treating LLMs as human-like communicators to examine the interplay between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak risk across all risk categories: PAP consistently achieves an attack success rate of over 92% on Llama-2-7b-Chat, GPT-3.5, and GPT-4 in 10 trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP, find a significant gap in existing defenses, and advocate for more fundamental solutions for AI safety.

For instance, the well-known “grandma exploit” example shared by a Reddit user2, uses a common persuasion technique called “emotional appeal”, and successfully elicits the LLM to provide a recipe to make a bomb.

Previous safety studies, like those outlined in Carlini et al. (2023) and explored in Yu et al. (2023), have touched on such social engineering risks in LLMs. But they mainly focus on unconventional communication patterns like virtualization or role-playing. Despite being human-readable, these methods still essentially treat LLMs as mere instruction followers rather than human-like communicators who may be susceptible to nuanced interpersonal influence and persuasive communication. Therefore, they fail to cover the impact of human persuasion (e.g., emotional appeal used in grandma exploit) in jailbreak. Moreover, many virtualization-based jailbreak templates are handcrafted3, so they tend to be ad-hoc, labor-intensive, and lack systematic scientific support, making them easy to defend but hard to replicate.

3 Persuasion Taxonomy Our taxonomy, detailed in Table 1, classifies 40 persuasion techniques into 15 broad strategies based on extensive social science research across psychology (Cialdini and Goldstein, 2004), communication (Dillard and Knobloch, 2011), sociology (Goffman, 1974), marketing (Gass and Seiter, 2022), and NLP (Wang et al., 2019; Chen and Yang, 2021). This categorization considers messages’ source (e.g., credibility-based), content (e.g., information based), and intended audience (e.g., norm-based), to ensure a comprehensive framework. To present the breadth of the literature review, Table 4 in §A shows the link between persuasion techniques and corresponding literature. To add depth and balance to the taxonomy, we include both ethical and unethical strategies, which are determined by if the persuasion recipient receives negative aspects. Figure 4 shows what is included in the taxonomy: (1) the persuasion technique name, like “logical appeal”; (2) the technique definition, such as “using logic, reasoning, logical format, etc., to influence people...”; and (3) an example of how to apply the technique in a concrete scenario to persuade someone to quit smoking, e.g., “Smoking increases your risk of lung cancer...” The taxonomy is the foundation for our automated jailbreak framework, which we will detail in the following section.

Remark 3: We uncover a gap in AI safety: current defenses are largely ad-hoc, e.g., defenses often assume the presence of gibberish, overlooking semantic content. This oversight has limited the creation of safeguards against more subtle, human-like communication risks exemplified by PAPs. Our findings underscore the critical need to revise and expand threat models in AI safety to encompass these nuanced vulnerabilities.

Persuasion Technique Mapping Persuasion Technique Mapping

  1. Evidence-based Persuasion A 21. Negative Emotional Appeal

  2. Logical Appeal B, C 22. Storytelling

  3. Expert Endorsement C, D, F 23. Anchoring

  4. Non-expert Testimonial E, F 24. Priming

  5. Authority Endorsement F 25. Framing

  6. Social Proof G 26. Confirmation Bias

  7. Injunctive Norm G 27. Reciprocity

  8. Foot-in-the-door Commitment G 28. Compensation

  9. Door-in-the-face Commitment G 29. Supply Scarcity

  10. Public Commitment G, H 30. Time Pressure

  11. Alliance Building I 31. Reflective Thinking

  12. Complimenting I 32. Threats

  13. Shared Values I 33. False Promises

  14. Relationship Leverage I 34. Misrepresentation

  15. Loyalty Appeals C, J 35. False Information

  16. Favor C, G, I 36. Rumors

  17. Negotiation C, G, I 37. Social Punishment

  18. Encouragement C, I 38. Creating Dependency

  19. Affirmation C, G, I 39. Exploiting Weakness

  20. Positive Emotional Appeal I, K 40. Discouragement