A meta-analysis of the persuasive power of large language models

Paper · Source

Large language models (LLMs) are increasingly used for persuasion, such as in political communication and marketing, where they affect how people think, choose, and act. Yet, empirical findings on the effectiveness of LLMs in persuasion compared to humans remain inconsistent. The aim of this study was to systematically review and meta-analytically assess whether LLMs differ from humans in persuasive effectiveness, and under which contextual conditions LLMs are particularly effective. We identified 7 studies with 17,422 participants primarily recruited from English-speaking countries and 12 effect size estimates. Egger’s test indicated potential small-study effects (p = .018), but the trim-and-fill analysis did not impute any missing studies, suggesting a low risk of publication bias. We then compute the standardized effect sizes based on Hedges’ g. The results show no significant overall difference in persuasive performance between LLMs and humans (g = 0.02, p = .530). However, we observe substantial heterogeneity across studies (I2 = 75.97%), suggesting that persuasiveness strongly depends on contextual factors. In separate exploratory moderator analyses, no individual factor (e.g., LLM model, conversation design, or domain) reached statistical significance, which may be due to the limited number of studies. When considered jointly in a combined model, these factors explained a large proportion of the between-study variance (R2 = 81.93%), and residual heterogeneity is low (I2 = 35.51%). Although based on a small number of studies, this suggests that differences in LLM model, conversation design, and domain are important contextual factors in shaping persuasive performance, and that single-factor tests may understate their influence. Our results highlight that LLMs can match human performance in persuasion, but their success depends strongly on how they are implemented and embedded in communication contexts.

Large language models (LLMs) such as GPT-4 are increasingly embedded in communication settings that aim to shape attitudes, preferences, or behaviors1–3. For instance, marketing professionals use LLMs to generate persuasive product descriptions or targeted advertisements4. Political campaigns have experimented with AIgenerated messaging to mobilize voters or improve communication styles5,6. In healthcare, LLMs are increasingly used for nudging healthier choices through personalized recommendations and framing techniques7, but their use is now also being explored in high-stakes clinical decision support, including assistance with disease diagnosis, triage, and treatment recommendations with potentially wide-ranging implications for patient care8–10.

These applications demonstrate the growing role of LLMs in persuasive communication, raising both promise and concern. On the one hand, LLMs offer scalable and adaptable tools for tailoring messages to individual recipients3, potentially improving the effectiveness of public information2 or customer engagement11, facilitating learning through multimodal and personalized educational support12, and supporting more constructive civic discourse13. On the other hand, the same technologies may be leveraged for manipulation, misinformation, or undue influence14–18, especially as LLMs become increasingly human-like in tone, reasoning, and responsiveness19. Given this dual potential, a key question emerges: how persuasive are LLMs? Despite a growing body of empirical studies, the evidence on the persuasive capabilities of LLMs remains inconsistent20,21. Reported effect sizes vary substantially: while some studies report that LLM-generated messages perform on par with or even better than human-written content1,2,22, others find the opposite, showing that LLMs fail to outperform human communicators in direct comparison23,24. However, we show later that these findings often stem from fundamentally different setups, including differences in model version, task format, and evaluation metric, which makes results rarely comparable across studies. As a result, there is no clear evidence for how persuasive LLMs are relative to humans. Here, we thus aim to answer the following research question: How effective are large language models at persuading humans, compared to human persuaders?

Theoretical background

Persuasion refers to the deliberate attempt to influence others’ beliefs, attitudes, or behavioral intentions through communication25. In social psychology, persuasion is considered a foundational mechanism for shaping human behavior and decision-making, with extensive research in domains such as health messaging, political advocacy, and consumer communication26,27.

Prominent theoretical models, including the Elaboration Likelihood Model (ELM), posit that persuasion operates via two distinct cognitive routes: a central route involving deliberate elaboration of message arguments and a peripheral route relying on heuristics such as source credibility or affective cues26. The likelihood of central versus peripheral processing depends on the recipient’s motivation and ability to elaborate, which, in turn, are shaped by factors such as personal relevance, prior knowledge, and contextual complexity28. The Persuasion Knowledge Model (PKM) further explains that, when recipients recognize a message as an attempt to persuade, their reactions may change accordingly, either by resisting or reinterpreting the message – which often activates the central processing route. For example, on social media, knowing an influencer is paid for a recommendation can lead followers to question the message (central route), whereas unawareness may prompt reliance on cues like attractiveness or popularity (peripheral route). While the dual-process model explains how people process persuasive messages, the model does not stipulate how processing translates into intentions and behavior.

Several theories may offer insights into how persuasion varies across individuals. The Theory of Planned Behavior29 emphasizes the role of attitudes, perceived norms, and behavioral control in shaping intentions and actions. Applied to the influencer example from above, the likelihood that a follower will buy a promoted product increases when, for example, followers view the product positively, believe that important others approve, and feel able to purchase and use it.

Together, these theoretical models highlight that persuasive effects do not stem from the message content alone, but emerge from dynamic interactions between message, communicator, and recipient. While originally developed to explain human-to-human persuasion, these theories are now increasingly used to assess whether AI-generated content—such as that produced by large language models—can trigger similar shifts in beliefs, attitudes, and behavioral intentions2.

Empirical studies in persuasion typically assess effectiveness through one or more outcome types: (i) changes in attitude (later referred to as “attitude”), (ii) shifts in behavioral intention (later referred to as “intention”), and (iii) actual behavior (later referred to as “behavior”)27,30. These outcomes are often measured using Likert-scale items or behavioral indicators such as agreement rates, policy support, or compliance decisions. Another common approach is to use perceived message effectiveness (PME) as a proxy for actual persuasion. In some cases, researchers aggregate multiple outcomes into composite indices to capture a broader construct of persuasive impact2,23,31,32. Of note, understanding how persuasion is operationalized is crucial later for interpreting and comparing effect sizes across studies. These measures vary not only in their scale and format but also in the underlying psychological construct they reflect, ranging from cognitive evaluations of message quality to affective or behavioral responses. These differences in outcome measures may account for part of the variability in effect sizes reported in the literature and thus inform our selection of moderators in our meta-analytic analysis.

LLM vs. human persuasion

To address our main research question, namely, whether LLMs are more effective at persuading humans compared to human persuaders, we conducted a random effects meta-analysis on all eligible studies that directly compared LLM-generated and human-generated persuasive messages. The corresponding forest plot is presented in Fig. 2. The meta-analysis reveals a negligible and statistically non-significant difference in persuasive effectiveness between the two groups. The pooled effect size is Hedges’ g=0.02 (p=.530), with a 95% confidence interval of [−0.048, 0.093].

On average, we find insufficient evidence for a difference in persuasiveness between LLMs and humans. This finding proved robust across both the full dataset and a sensitivity analysis restricted to peer-reviewed studies only, indicating that the inclusion of preprints does not bias our results (see Table 4). However, this overall result does not rule out meaningful differences between specific study contexts or subgroups. Given that we found substantial heterogeneity (I2=75.97%), this suggests that over half of the observed variance in effect sizes reflects actual differences across studies rather than random sampling error. Motivated by this, we next exploratively examine potential moderators that may help explain under which conditions LLMs outperform or underperform relative to humans.

Our meta-analysis reveals that current evidence does not demonstrate a consistent difference in persuasiveness between LLMs and humans. This runs counter to overoptimistic narratives1,22, while also challenging skeptical perspectives that downplay the persuasive capacity of LLMs20,52. Instead, the substantial heterogeneity observed across studies suggests that persuasive effectiveness is likely conditional on contextual factors. Accordingly, we argue for a conceptual shift away from binary “AI versus human” framings and toward a more nuanced understanding of how model capabilities, communication design, and task characteristics jointly shape persuasive effects.

To better understand which design and contextual factors influence the persuasiveness of LLMs, we examined variation across studies via a model with all combined moderators. The result that individual moderator analyses (e.g., LLM model, conversation design, domain) did not yield consistently significant effects was likely due to the small number of studies and therefore insufficient power to detect differences. However, the combined model explained a large share of the variance between studies (R2=81.93%) and considerably reduced unexplained heterogeneity (I2=35.51%). This indicates that contextual factors may play an important role in shaping persuasive outcomes. For example, when holding other factors constant, interactive setups were more persuasive than one-shot formats, GPT-4-based models outperformed Claude 3.x, and health-related topics yielded stronger effects than political domains. However, with few studies and many predictors, the combined model may overfit the data.

Implications

The findings of our meta-analysis point to important implications for practice, society, and research. For practitioners (e.g., in marketing), this suggests that the persuasive effectiveness of LLMs cannot be taken for granted but must be evaluated in light of specific use cases. Multi-turn conversational interactions may offer advantages over one-shot prompts in contexts where persuasive influence is the goal, such as marketing, political messaging, or customer engagement22,53. In such interactive contexts, LLMs could draw on strengths like the ability to personalize messages, evidence-based arguments and reason coherently1,2,22,32.

Conversely, simpler LLMs or static one-shot deployments may underperform and even lead to unintended outcomes. Recent surveys in computational persuasion warn that minimally contextual or one-shot approaches are especially prone to bias, adversarial manipulation, and context insensitivity, which can reduce effectiveness or backfire in persuasive use cases19. Furthermore, one-shot message formats may expose limitations of LLMs seen in other domains, such as creative writing, where outputs are often less original than those of humans34,54.

This pattern might similarly translate to the domain, as the effectiveness of LLMs might be dependent on the topic discussed. Across the reviewed studies, human-generated messages were typically more emotionally vivid and personally engaging, whereas LLM-generated texts relied more on analytical reasoning and informational coherence1,2,17,22,31. LLMs may thus be particularly effective in domains where fact-based reasoning and logical elaboration are central to persuasion, while they may be less effective when emotional resonance, empathy, or narrative authenticity are required.

From a societal and ethical perspective, the findings raise concerns about the responsible use of persuasive LLMs1. The mechanisms that might contribute to persuasive effectiveness, such as the specific conversation design, can equally be leveraged to manipulate, deceive, or exploit users. This risk is particularly salient in high-stakes domains like political communication and misinformation55 or when LLMs are used as chatbots for mental health support, where dynamic engagement may bypass existing safety frameworks by LLM developers. As LLMs increasingly approximate human-like tone and reasoning56,57, the boundary between legitimate influence and manipulation becomes increasingly difficult to delineate58. A further emerging area of concern is healthcare, where LLMs are increasingly explored for diagnostic reasoning, triage, and treatment recommendations8–10; in such settings, persuasive or overly confident outputs could anchor clinicians on incorrect decisions and pose direct risks to patient safety. This highlights the need for robust ethical guidelines59–62 that define acceptable forms of AI-mediated persuasion. Developers and regulators should not only focus on technical capabilities, but also consider transparency of intent, user autonomy, and the design of interaction structures that prevent coercive or deceptive practices.

From a theoretical perspective, our findings inform our conceptual understanding of both human–AI collaboration and persuasion research. In human–AI collaboration theory, the observed heterogeneity of model capabilities, communication format, and domain aligns with frameworks emphasizing that humans and LLMs possess complementary strengths, suggesting that effective persuasion may arise when these distinct abilities are combined63,64. Qualitatively, synthesizing the observed studies, the observed heterogeneity aligns with the Elaboration Likelihood Model of persuasion: LLMs might show potential along the central route of persuasion, which relies on analytical processing, whereas human communicators retain strengths along the peripheral route, which depends on emotional, relational, and identity-based cues26. Moreover, LLMs offer novel opportunities for theory testing65 by enabling systematic manipulation of message characteristics in controlled, large-scale environments. Such setups allow researchers to replicate classic persuasion experiments or explore theoretical mechanisms at a scale and level of experimental control that would be difficult to achieve with human participants alone66–68. By identifying conditional factors rather than relying on binary human-versus-AI comparisons, our study contributes to a more nuanced understanding of how, when, and why LLMs can be persuasive.