Why does argument scheme classification stumble where other NLP tasks succeed?
Explores whether the abstract, relational nature of argument schemes makes them harder to classify than concrete argument components or stance. Matters because understanding this difficulty gap could improve scheme recognition systems.
Argument-mining NLP tasks divide along a hidden axis of difficulty. Identifying argument components (claim, premise, warrant) is a span-tagging task — the unit is a piece of text, and the cues are positional and lexical. Identifying stance is a sentence-level classification task — the cues are sentiment and polarity. Identifying argument schemes in Walton's taxonomy is categorically harder because the unit of recognition is not a piece of text but a pattern of reasoning linking premises to a conclusion through a specific inferential move.
The empirical signature of this difficulty is a flat plateau around F1 0.55–0.65 across both pretrained language models and modern LLMs. BERT achieves F1 0.53; the strongest large model reaches 0.65 in the most favorable configuration. The same models that classify stance and tag argument components well above 0.80 stall on schemes. This is not a scaling issue alone — it is an evidence that scheme recognition requires integrating multiple text spans (premises and conclusion) and reasoning about the inferential bridge between them.
The cognitive-load framing predicts further failure modes. Tasks where the recognition target is a relation among text segments (rather than a property of a single segment) should consistently underperform tasks where recognition is local. Argument scheme classification is one instance; others include rhetorical relation classification in RST, discourse coherence relations, and counterfactual implication. The shared structure is that the evidence for the label is distributed across the input and requires integration.
The practical implication is that argument scheme labels are not yet a reliable feature for downstream pipelines. Systems that need scheme-aware behavior (dialectical evaluation, legal reasoning, value alignment dialogues) should either restrict to a smaller set of schemes with strongest classification performance, or use schemes' critical questions as a probing structure rather than relying on classification.
Related concepts in this collection
-
Can large language models classify argument schemes reliably?
Explores whether LLMs can recognize Walton's 60+ argument schemes—abstract patterns of reasoning rather than surface features—and what conditions enable accurate classification.
same paper, the empirical evaluation
-
Why do reasoning models struggle with theory of mind tasks?
Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
analogous: integrative reasoning tasks behave differently from local-pattern tasks
-
Can structured argument prompts make LLM reasoning more rigorous?
Does requiring language models to explicitly check warrants, backing, and rebuttals—rather than reasoning freely—improve reasoning quality and catch failures that standard step-by-step prompting misses?
the workaround: use scheme structure to drive reasoning rather than as a classification target
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
argument scheme classification carries higher cognitive load than other argument NLP tasks because schemes are abstract presumptive patterns not surface features