What combination of factors explains differences in LLM persuasiveness?
Why do some LLM persuasion studies show strong effects while others show none? This explores whether model choice, conversation design, and topic domain together predict when AI actually persuades.
When the Bilstein meta-analysis tested moderators individually, none reached significance — likely a power problem with only 7 studies. But the joint model combining LLM model family, conversation design (one-shot vs interactive multi-turn), and domain (health, political, etc.) explained R² = 81.93% of between-study variance and dropped residual heterogeneity from I² = 75.97% to I² = 35.51%. The conditional patterns reported, holding other factors constant: interactive multi-turn outperformed one-shot formats; GPT-4-based models outperformed Claude 3.x; health topics yielded stronger effects than political ones.
This is the operational corollary of Are language models actually more persuasive than humans?. The pooled-null result and the joint-moderator result are not in tension — they are two sides of the same finding. Average effect ≈ 0; conditional effect = whatever the model × design × domain combination dictates. The persuasive footprint is in the dial settings, not in the category.
The multi-turn-beats-one-shot finding reweights design priorities. It connects directly to Why do AI conversations reliably break down after multiple turns? as a topic area: persuasive influence accrues across turns, and conversational architecture is consequential for outcomes that one-shot generation cannot reach. This also intersects with Why does AI persuasion weaken over repeated interactions? in a productive tension. Bilstein finds interactive setups more persuasive than one-shot in pooled terms; Schoenegger finds persuasive advantage over humans waning across rounds. Both can be true: the multi-turn benefit is real but is a benefit shared with human persuaders, while the LLM-specific edge is concentrated at first contact.
The model-family signal (GPT-4 > Claude 3.x in this corpus) cautions against generalizing from any single model. Claims about "LLM persuasiveness" anchored to one architecture should be read as architecture-specific until replicated.
For writing about AI persuasion, the operational rule: don't quote a single-study effect size. Cite the meta-analytic null, then specify the dial settings under which a conditional effect appears.
Source: Argumentation Paper: A meta-analysis of the persuasive power of large language models
Related concepts in this collection
-
Are language models actually more persuasive than humans?
Does the research evidence support claims that LLMs persuade more effectively than humans, or have we been cherry-picking studies to fit a narrative?
pooled-null and joint-moderator are two sides of the same finding
-
Where does AI's persuasive power actually come from?
Explores which techniques make AI most persuasive—and whether the usual suspects like personalization and model size are actually the main drivers. Matters because it reshapes where to focus AI safety concerns.
design dials documented at the training level appear at the meta-analytic level too
-
Why does AI persuasion weaken over repeated interactions?
Claude and DeepSeek lose their persuasive edge as people encounter them repeatedly, unlike human persuaders. Understanding this decay could reveal where AI manipulation poses the greatest risk.
productive tension on multi-turn effects
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
combined moderators — model conversation design and domain — explain ~82% of between-study variance and interactive multi-turn beats one-shot