Can Large Language Models Understand Argument Schemes?

Paper · Source

Argument schemes represent stereotypical patterns of reasoning that occur in everyday arguments. However, despite their usefulness, argument scheme classification — that is, classifying natural language arguments according to the schemes they are instances of — is an under-explored task in NLP. In this paper we present a systematic evaluation of large language models (LLMs) for classifying argument schemes based on Walton's taxonomy. We experiment with seven LLMs in zero-shot, few-shot, and chain-of-thought prompting, and explore two strategies to enhance task instructions: employing formal definitions and LLM-generated descriptions. Our analysis on both manually annotated and automatically generated arguments, including enthymemes, indicates that while larger models exhibit satisfactory performance in identifying argument schemes, challenges remain for smaller models.

Argument schemes provide structured templates capturing stereotypical forms of arguments consisting of inferences from premise(s) to a conclusion. These schemes have traditionally been used in formal logic-based argumentation to support reasoning and deliberation, with particular emphasis on supporting dialogues for value alignment. Classifying argument schemes can enhance decision-making in AI systems, particularly in domains requiring complex reasoning, such as legal and medical applications. Incorporating argument schemes into decision-making processes promotes transparency and explainability, which are essential in high-stakes applications such as healthcare, finance, and governance.

In this paper, we focus on Walton's taxonomy, which proposes over 60 argument schemes each of which relate premises to a conclusion, and each of which have associated critical questions that identify how to challenge arguments that are instances of the scheme. The most prevalent being the taxonomy developed by Walton et al. (2008). However, classifying argument schemes is a particularly challenging task. Indeed, the cognitive load and resources associated with this task is higher compared to other tasks such as identifying distinct components of arguments and their stance towards a topic or other arguments.

Our comprehensive evaluation covered zero-shot, few-shot, and chain-of-thought prompting methods across seven open-source and proprietary models on manually annotated human-made arguments, including enthymemes, as well as automatically generated arguments. Furthermore, we explored two approaches for enhancing the task instructions: normative information (i.e. the formal definitions of argument schemes as per Walton's taxonomy) and LLM-generated descriptions of argument schemes. Our analysis revealed that larger models can identify argument schemes satisfactorily in few-shot settings when given descriptions of argument schemes, in contrast to smaller models which struggle more. Compared to pre-trained language models (PLMs), LLMs generally perform better. BERT achieves F1 of 0.53, with other PLMs not exceeding 0.55. In contrast, in the few-shot setting, nearly all larger LLMs achieve F1 scores above 0.55. The best PLM reported on EthiX, ERNIE, yields F1 of 0.63, while the top-performing LLM, Claude, surpasses this with an F1 score of 0.65.

Can Large Language Models Understand Argument Schemes?

Synthesis notes that discuss concepts related to this paper