Argument Summarization and its Evaluation in the Era of Large Language Models

Paper · arXiv 2503.00847 · Published March 2, 2025

Large Language Models (LLMs) have revolutionized various Natural Language Generation (NLG) tasks, including Argument Summarization (ArgSum), a key subfield of Argument Mining (AM). This paper investigates the integration of state-of-the-art LLMs into ArgSum, including for its evaluation. In particular, we propose a novel prompt-based evaluation scheme, and validate it through a novel human benchmark dataset. Our work makes three main contributions: (i) the integration of LLMs into existing ArgSum frameworks, (ii) the development of a new LLM-based ArgSum system, benchmarked against prior methods, and (iii) the introduction of an advanced LLM-based evaluation scheme. We demonstrate that the use of LLMs substantially improves both the generation and evaluation of argument summaries, achieving state-ofthe- art results and advancing the field of ArgSum.

Automatic Text Summarization (ATS) aims to condense the key ideas from one or more documents into a concise summary (Radev et al., 2002), while minimizing redundancy (Moratanch and Chitrakala, 2017). While abstractive summarization generates a summary including text units that do not necessarily appear in the source text, extractive summarization identifies the most important parts of a document and assembles them into a summary (Giarelis et al., 2023). ATS consists of several sub-areas like News Summarization (Sethi et al., 2017), Legal Document Summarization (Anand and Wagh, 2022), Scientific Paper Summarization (Zhang et al., 2018), and ArgSum (Bar-Haim et al., 2020a,b). Our focus is the latter.

Misra et al. (2016), Reimers et al. (2019) and Ajjour et al. (2019) treat the task of summarizing arguments as a clustering problem

Wang and Ling (2016) frame ArgSum as claim generation, where a collection of argumentative sentences is summarized by generating a one-sentence abstractive summary that addresses the shared opinion of the inputs. Schiller et al. (2021) present an aspect-controlled argument generation model that enables an abstractive summarization of arguments.

Our understanding of ArgSum is in line with Key Point Analysis (KPA), introduced by Bar-Haim et al. (2020a,b), and is displayed in Figure 1. They aim to create an extractive summary of the most prominent key points from a potentially large collection of arguments on a given debate topic and stance. Then, each source argument is classified according to the most suitable key point. Alshomary et al. (2021) perform the key point extraction by utilizing a variant of PageRank (Page et al., 1998). Li et al. (2023) extend KPA with a clustering-based and abstractive approach, using grouped arguments as input for a generation model to create key points.

Khosravani et al. (2024) introduce a clusteringbased and extractive approach, selecting the most representative argument within each cluster as a key point, determined by a supervised scoring model.

SMatchToPr To identify argument summary candidates, SMatchToPr (Alshomary et al., 2021) uses a variant of PageRank (Page et al., 1998). To this end, candidates are understood as nodes in an undirected graph, utilizing the match scores between each candidate pair as edge weights. Only nodes with edge weights above a threshold tn are connected. Based on the resulting graph, an importance score is calculated for each candidate. Then, SMatchToPr minimizes redundancy by removing candidates whose match score with a higher-ranked candidate exceeds a threshold tm. This results in the final set of argument summaries.

In accordance with previous work on generating argument summaries, we assess the two evaluation dimensions of coverage and redundancy. Coverage refers to the extent to which a set of argument summaries captures the central talking points of a debate. Redundancy is concerned with the extent of content overlap between the individual argument summaries (Bar-Haim et al., 2020b; Alshomary et al., 2021; Friedman et al., 2021; Li et al., 2023; Khosravani et al., 2024). Both criteria are closely related in order to assess the overall quality of a set of argument summaries, as high coverage can be achieved by generating many redundant argument summaries (Friedman et al., 2021).

The LLM-based ArgSum evaluation scores we propose show very high correlation with human judgements and thus set a very reliable evaluation framework where reference summaries are available.