Can Language Models Recognize Convincing Arguments?

Paper · arXiv 2404.00750 · Published March 31, 2024

propose tasks measuring LLMs’ ability to (1) distinguish between strong and weak arguments, (2) predict stances based on beliefs and demographic characteristics, and (3) determine the appeal of an argument to an individual based on their traits.

LLMs perform on par with humans

increasing accessibility and capability of state-of-the-art models such as GPT-4 and Claude could exacerbate this issue by enabling the creation of highly personalized misleading content with unprecedented ease (Bommasani et al., 2021; Goldstein et al., 2023).

generation is easier than detection for LLMs

can LLMs. . .

RQ1: judge the quality of arguments and identify convincing arguments as well as humans?

RQ2: judge how demographics and beliefs influence people’s stances on specific topics?

RQ3: determine how arguments appeal to individuals depending on their demographics?

we examine to which extent LLMs can capture which debates were considered to have better arguments (RQ1) and to which extent arguments were effective across different demographics (RQ3).

research on microtargeting should move from “does microtargeting work?”, to “when micro-targeting works?” At the same time, the increasing popularity and capabilities of large language models (LLMs) have raised concerns that they may make microtargeting cheaper and better and that they may enable new ways to “microtarget” misinformation and propaganda, e.g., personalized chatbots

Our findings indicate that large language models demonstrate human-level performance in (1) judging argument quality, (2) predicting users’ stances on specific topics given users’ demographics and basic beliefs, and (3) detecting arguments that would be persuasive to individuals with specific demographics or beliefs. However, the overall human performance is not high in each of the three tasks [around 60% for (1), and around 40% for (2) and (3)], which could be due to the inherent difficulty of the tasks, as well as variance and randomness in the data.

One hypothesis that could explain the relatively low accuracy for both LLMs and human performance is that these demographic questions and big-issue stances may not be highly relevant for the task,