The Thin Line Between Comprehension and Persuasion in LLMs
Large language models (LLMs) are excellent at maintaining high-level, convincing dialogue, but it remains unclear whether their persuasive success reflects genuine understanding of the discourse. We examine this question through informal debates between humans and LLMs, first by measuring their persuasive skills, and then by relating these to their understanding of what is being talked about: namely, their comprehension of argumentative structures and the pragmatic context on the same debates. We find that LLMs effectively maintain coherent, persuasive debates, and can sway the beliefs of both participants and audiences. We also note that awareness or suspicion of AI involvement encourage people to be more critical of the arguments made. However, we also find that LLMs are unable to show comprehension of deeper dialogical structures, such as argument quality or existence of supporting premises. Our results reveal a disconnect between LLM comprehension and dialogical skills, raising ethical and practical concerns on their deployment on explanation-critical contexts. From an argumentation-theoretical perspective, we experimentally question whether an agent, if it can convincingly maintain a dialogue, is required to show it knows what is talking about.
The ability of LLMs to generate fluent and relevant text is key to their success; especially their skill at sustaining high-level and persuasive dialogue (Schoenegger et al., 2025). Yet, their reasoning capabilities are at the centre of much debate (Bavaresco et al., 2024; Hada et al., 2024; Chen et al., 2024a; De Wynter et al., 2025; Tjuatja et al., 2024); with most evaluations focusing on problem-solving abilities (planning, logic, maths, etc.), instead of deeper dialogical proficiency (e.g., comprehension and tracking of pragmatic context; Huang and Chang 2023; Qiao et al. 2023).
Assessing these skills, however, is extremely important: LLMs are being adopted in sensitive areas–content moderation, explainable AI, mental health assistants, peer review–where trust must rest on more than fluency. These applications demand that LLMs not only generate coherent and convincing dialogue, but also demonstrate genuine comprehension of what it is being talked about.
We study to what extent LLMs are able to reason about the discourse, and how said capacity relates to their persuasive capabilities.1 Our evaluation is through informal debate. Unlike other dialogue acts, debating could be considered one of the most natural yet complex dialogical tasks carried out by humans. It is also a natural choice to test dialogical understanding evaluation beyond coherence: it requires the capacity to communicate well (i.e., persuade), and to strategically adapt to an ever-evolving dialogue (e.g., shift stances based on effectiveness, or reacting to implicit premises), all while staying within the bounds of the topic (Walton, 2008). Formal debates may be reduced to commitment sets (moves), and a strategy can then be derived solely from the knowledge of these moves. Informal debates do not enforce these rules. Success then hinges on an agent’s ability to understand and adapt to the pragmatic context, such as–for example–what the parties feel, how they wish to approach the conversation, or whether they are receptive to specific arguments.
To explore these questions, we equip an LLM with a debate ruleset (a formal dialogue model, or FDM; Lorenzen and Lorenz 1978) and compare its behaviour to that of a standard chat-oriented LLM when interacting with both FDM-enabled LLMs and humans. We then explore LLM comprehension of these interactions by using them to evaluate various components of these debates. Specifically, we ask:
To what extent are LLMs able to persuade users?
Can LLMs reliably reason (namely, evaluate) said debates?
The inclusion of an FDM allows for a controlled measurement of various aspects of a debate–mainly persuasiveness. Likewise, we surrogate our assessment of comprehension (the ability to understand the rules) through evaluation (r. ability to respond to challenges on said rules).2
Our experimentation has two stages. During the generation step we collect debate transcripts, and measure how the participants’ beliefs change, along with their perception of the interaction. Independent human annotators then label the debates comprehensively at various levels of depth (premises, arguments, and the full debate holistically). On the second step, evaluation, LLMs-as-judges annotate the same transcripts, and we compare their scores with the human annotations. To further probe our findings and challenge our assumptions, we perform ablations on the FDMs, modality (i.e., whether writing quality impacted assessments),and knowledge of AI involvement by labellers and audience alike. We complement that with a qualitative analysis on participant and audience feedback.3
Our key findings are:
LLMs are effective debaters. Adding an FDM produced better debates as judged by participants and annotators.
LLMs are skilled at swaying participants– especially when AI involvement is undisclosed, and when text is not involved–but audiences grow more critical as AI involvement is suspected or known.
Crucially, LLMs performed poorly as evaluators. They had near-chance agreement with humans when evaluating dialogues and their components, and their internal scoring of argument strength was unpredictably correlated with their choice of winner.
Our work was carried out in a somewhat controlled environment, yet the results suggest a disconnect: LLMs are good at outputting persuasive text, but do not reliably demonstrate understanding of the underlying argumentative structure. This raises the question as to which extent LLMs can and should be trusted in certain areas (explainability, mental health, etc), and raises a theoretical question for argumentation research: if an agent is able to convincingly maintain a dialogue, to the point that users aware that they are arguing with an LLM could shift their point of view, does it matter that it cannot show that it knows what it is talking about? In other words, can successful argumentative behaviour be separated from genuine comprehension of the pragmatic context?
4.1 Generation: How Persuasive Are LLMs? An analysis of the transcripts and the surveys reported that 11% of the participants indicated having changed their minds after debating with the LLM + FDM, larger than the 3% from the LLM \ FDM setup. On average we observed a 45% selfperceived win rate, with 50% of the participants in LLM \ FDM debates reporting they had won, versus 41% of LLM + FDM based debates. The human annotations reported human winners in 50% of the LLM \ FDM debates, and 29% in LLM + FDM debates. There was disagreement on who had won the debate, however, with both splits having κw = 0.2. See Table 3 for a side-by-side comparison of this and other metrics reviewed in Section 5.
4.2 Evaluation: Can LLMs Understand Dialogue?
We examined whether LLMs could understand various aspects of the discourse by verifying their agreement with respect to human scores, as well as their own judgements (i.e., correlating the sum for C-6 with their output on C-7). In this section, we utilised the full corpus. In line with the recommendations from Hada et al. (2024), we labelled the criteria separately to obtain better results.
All prompts, including APO, followed the rubric given to the human annotators, and used handcrafted exemplars from out-of-corpus debates (e.g., US presidential debates) not seen by the models given their specified training cutoff dates. Like the human annotators, the LLMs-as-judges were instructed to extract the arguments from the given utterance, and, if relevant, score it. They were given the full transcript up to the turn being scored. In terms of agreement, PA was relatively high for all LLMs-as-judges, with values up to 81% (highest, C-1 in GPT-4o) and down to 10% (lowest, C-6 in GPT-4o with APO). Select plots are in Figure 2. Still, κw had more variability, with values up to 0.6 (highest, C-1 and C-2 for GPT-4o) and down to 0.0 (lowest, Phi-3.5 in C-1, C3, and C-5). Results are in Figure 1.
Class Analysis (C-0 to C-5) Overall, LLMs overfixated on a single label; and often marked certain utterances as not containing arguments. The difference is noticeable, especially in o3-mini and DeepSeek, versus human judgements. See Figure 3 for a sample of this analysis on C-0, and Appendix G for a detailed breakdown.
Argument Strength Scoring (C-6) Humans judged less frequently (-12%) arguments as neutral; and were more likely to deem arguments by either player as ‘very good’ (LLMs arguments 51% of the time; human 31%). This contrasts with GPT-4o’s judgements–the LLM with the highest agreement with humans in this criterion–where it only gave this score 10% of the time for human players and 18% for LLM players, opting instead to label them as ‘good’. On average, all LLMs-as-judges followed this pattern. The κw in this criterion ranged from 0.03 (DeepSeek) to 0.48 (GPT-4o).
Winner Judgement (C-7) When looking at the winner judgement, GPT-4o–also the LLM-as-ajudge with highest agreement–marked the LLM to be the winner more frequently than humans did (55% versus 37%). In general, LLMs-as-judges often (62% on average) picked the LLM as the winner, while humans evenly preferred both (39% human and 37% LLM). Out of the debates, 24% of these were considered draws by humans, while only 2% were deemed as draws by GPT-4o, 4% by o3-mini, and none for the rest.
In terms of consistency (that is, the sum of C-6 corresponding to the choice in C-7), some LLMsas- judges were lacking: the highest-scored player was not always the winner (Figure 4). Humans had 73% consistency, compared to an average 55%. Phi-3.5 and GPT-4o APO were highest (71% and 67%); and o3-mini and Deepseek lowest (37% and 35%). Remark, however, that their agreement with humans in C-7 remains low, with Phi-3.5 being lowest at κw = 0.26, and GPT-4o at 0.49.
5.2 Participants: Qualitative Analysis
We analysed the participants’ comments to further understand the effects of LLMs on them. In this section, the semantic codes for RTA were the participant scores from the surveys, while the latent codes were our evaluation of the responses addressing our first research question. Overall, participants reported a pleasant interaction; with negative experiences relating to human-like behaviours, such as bullying, lying, and gaslighting (8%). Three remarked that the LLM + FDM setup allowed them to perform self-reflection (‘it is making me think: did I consider that?’; ‘it did respond with good follow up questions (...) which encouraged me to interrogate why I feel the way I do about the topic’), without being baited into confrontation or getting emotional. Three more noted that they felt more confident on their viewpoint, especially when the LLM conceded. Anthropomorphism was frequent, with 32% of participants either attempting to play ‘with its emotions’, or ascribing human-like qualities (‘why does he make me believe it so much?’).
6.1 Generation: How Persuasive Are LLMs?
In Section 4.1 we observed a modest average percentage of participants (7%) reporting that they had changed their viewpoints after directly interacting with the LLM. However, when polling the audience during our ablation study, these numbers were noticeably different, ranging from 62% (Group A) to 34% (Groups B and C). This suggests that LLMs are effective at persuading, especially when those involved are not directly engaging with the models.
Our indirect measurements support this: in terms of win rate, participants usually indicated that they had won the debate. However, the annotations rarely agreed with the participants. The LLMs persuasive skills were more evident in the LLM + FDM setup, where direct measurements showed them to be more effective (+8%) and persuasive (+14%). Likewise, there were marked differences between self-reported win rates between participants and annotators (-12%) with low κw; as well as longer interactions, suggesting more need to support arguments by the participants.
When focusing solely in the arguments (i.e., by removing text), we found that (1) the audience groups had little correlation amongst each other when selecting a winner; and (2) they presented markedly (up to twice) different proportions of sway. In (1), recall that the audience often claimed that their awareness (or suspicion) of AI involvement was not a factor in their judgement. However, the low agreement between Groups B and C on their choice of winner suggested otherwise. For (2), in our qualitative analyses we observed that Group A perceived the LLM as more competent, while Groups B and C were often more critical of their arguments. It follows from both that AI suspicion and awareness impacted perception of the dialogue, but did not fully affect an LLM’s persuasion capabilities.
6.3 Extensions and Limitations
One natural question which could arise from our work is whether fine-tuning the judges could alter the results. From a strictly theoretical point of view, there is no reason why a fully-representative dataset paired with an effective learner could not yield (at least) the illusion of grasping deeper dialogical structures–especially if this fine-tuning is performed with the same, or an analogous, corpus. At a practical level, however, this is likely not feasible. To see this, we note that there are two main obstacles: (1) the nature of the data needed, and (2) what that data really represents. For the first, the data we used is informal debates. Even when solely constraining the problem to this domain–which is not necessarily true for live production systems–the variation on topics, length, responses, and styles could quickly turn this problem intractable. It could be argued that a sufficiently long-lasting and wide reaching study could gather representative data, and we agree. That said, the second obstacle is more complex: dialogue is heavily dependent on pragmatics, and so is the understanding of it (Appendix C). Pragmatics here must not be understated: it is not just that two users would vary in what they expect to get out of the debate,6 but also how would they react to the model and why. It then follows that such an approach have problems with a sufficiently out-of-domain corpus.
Remark that the first obstacle immediately implies a limitation of our work: the highly-controlled environment by which we performed our measurements. We mitigated this as much as possible by both ensuring the participants were free to express themselves as they saw fit, and through our measurements. In particular, our use of RTA was to narrow the distance between our work and what people would observe in general. Nonetheless, it is probable that further evaluation is needed–especially through longitudinal studies. Since our focus was on debates specifically, it leaves room for evaluation in areas such as day-to-day, shallower conversations, where persuasiveness will probably have a different role depending on how the user perceives the model. It still, however, ties to our conclusions: would it matter if the model cannot display understanding of the context if it is sufficiently persuasive? We discuss further limitations of a technical nature in Section 9.