Argument Quality Assessment in the Age of Instruction-Following Large Language Models
Rather than just fine-tuning LLMs towards leaderboard chasing on assessment tasks, they need to be instructed systematically with argumentation theories and scenarios as well as with ways to solve argument-related problems.
When learning about controversial issues, people rarely accept arguments they encounter without further contemplation. Rather, they seek to find the best arguments; those that help them form an opinion or write texts that persuade others; those that make them reach agreement or at least understand each other better. That is to say, argument quality is of interest as soon as arguments are presented to an audience. Computational argumentation aids the treatment of arguments at a larger scale, with important applications in search (Wachsmuth et al., 2017c), business (Slonim et al., 2021), and education (Wambsganss and Niklaus, 2022). But the situation there is the same: It is not enough to mine or generate arguments; their quality also needs to be evaluable (Park and Cardie, 2018), so that it can be assessed (Lauscher et al., 2020), flaws can be found (Goffredo et al., 2022), and accounted for (Skitalinskaya et al., 2023).
two inherent challenges of argument quality were visible already: the diversity of quality notions as well as the subjectivity of their perception and, hence, of their assessment for both humans and computational models.
In particular, we are convinced that the capabilities of instruction-following LLMs enable research to overcome many aspects of the two challenges. To this end, the primary focus of NLP research on argument quality should be put on systematic ways to teach LLMs to follow instructions, including concepts and settings of arguing in addition to ways to solve argument-related problems (Section 3). Instead of fine-tuning LLMs on predefined domains (manifested in the training data) and preselected theories (manifested in the data’s annotations), as well as simple engineering of prompts, we expect the greatest impact to lie in teaching LLMs the theories, circumstances, and ethical constraints to adhere to. The rationale behind this is that LLMs will often
While Argyle et al. (2023) developed LLMs that tone down argumentative conversations, we postulate a contrary path: Exploiting the means of LLMs to proactively enable people to learn and better reason about controversial issues, thus contributing towards more deliberate conversations (Vecchi et al., 2021).
Conceptually, argument quality assessment is a classification or regression problem, even if partly treated as preference learning. For decades, research in NLP relied on the traditional supervised learning paradigm for most such problems: to induce a mapping from one representational space to another using training pairs of input and output. As sketched in Figure 2a, the input spaces (usually, representing natural language) and output spaces (label schemes or value ranges, such as argument quality scores) are separated thereby. This separation prevents any exchange
we can expect that most knowledge about quality notions and their interdependencies as well as about influence factors and their effect on the subjective perception of argument quality has already been processed by leading LLMs, such as GPT-4
By default, however, LLMs do not necessarily have access to what is to be prioritized for the setting of the task at hand. We expect that this is the information an LLM needs to be instructed with to assess argument quality reliably.
include but are not limited to:
• arguing goals, from agreement (El Baff et al., 2018) to deliberation (Falk et al., 2021);
• definition of various quality notions, be it for maximal quality (Lauscher et al., 2020) or minimal quality (Jin et al., 2022);
• specificities of audiences (Lukin et al., 2017) and debaters (Durmus and Cardie, 2019);
• background on controversial topics, such as other arguments (Luu et al., 2019) or relationships between the topics (Zhao et al., 2021);
• ethical aspects, such as biases (Holtermann et al., 2022) and culture (Chen et al., 2022a);
• any examples of respective assessments, following the common few-shot learning idea.
Instructions may overcome all these challenges. An example may be: “Rate the claim’s quality from the perspective of deliberation, when presented to a person of low literacy”. This makes the semantics of the task much more explicit, likely reducing biases and spurious correlations. It performs a stage-setting and an addressee-setting function which are crucial for assessment and improvement.
argument quality assessment may be brought to the next level through systematic approaches to task-related instruction fine-tuning
The idea is to bring specific knowledge about theories, circumstances, and ethical constraints of arguing along with ways of how to solve argument-related problems into the fine-tuning process.