Why do LLM descriptions of argument schemes work better than formal definitions for classification?
This explores why an LLM's own plain-language paraphrase of an argument scheme beats a formal Walton-style logical definition when you prompt it to sort arguments into categories.
This explores why an LLM's own plain-language paraphrase of an argument scheme beats a precise, expert logical definition when classifying arguments. The corpus gives a direct answer and then a deeper one. The direct finding is that Why do paraphrased definitions work better than expert ones? paraphrases match the model's training distribution better than formal logical vocabulary — the model has simply seen far more text that talks about reasoning in ordinary words than text written in the compressed notation of argumentation theory. A definition isn't 'better' just because it's more rigorous; it's better only if it lands in territory the model has actually traveled.
Why that should be true gets sharper when you look at what LLMs are doing under the hood. They behave as Do large language models reason symbolically or semantically? — leaning on token associations and commonsense rather than manipulating formal symbols. Strip the everyday semantics out of a task and performance collapses even when the correct rules are sitting right there in the prompt. A formal Walton definition does exactly that stripping: it replaces familiar phrasing with logical scaffolding the model can't reason over symbolically, so it loses the very handhold it relies on. The same logic is predictable from a Can we predict where language models will fail? view: an autoregressive model finds low-probability phrasings systematically harder, and formal definitions are exactly the rare, low-probability register that trips it up.
There's a second layer worth knowing: even with good descriptions, this task is just hard. Classification of argument schemes Why does argument scheme classification stumble where other NLP tasks succeed? because it requires spotting an inferential pattern spread across a whole passage, not a local surface cue — which is why models plateau around F1 0.55–0.65 here while clearing 0.80 on simpler tagging tasks. And it only works at all Can large language models classify argument schemes reliably? with few-shot examples plus descriptions; zero-shot fails across the board. So the description isn't a minor prompt-tuning trick — it's load-bearing, because it carries the model over a representational gap that formal definitions widen rather than close.
The quietly surprising takeaway: the thing that makes a definition good for a logician — precision, abstraction, formal vocabulary — is exactly what makes it bad for an LLM. These models pattern-match on surface form, not deep structure, which is the same reason their Why do large language models fail at complex linguistic tasks? grammatical competence degrades with structural complexity, and the same disconnect behind Can LLMs understand concepts they cannot apply? cases where a model can recite a concept's definition yet fail to apply it. A paraphrase works not because it's clearer to a human, but because it speaks in the statistical dialect the model already fluently inhabits.
Sources 7 notes
LLM-generated descriptions of argument schemes yield better classification performance than expert Walton definitions. The advantage stems from paraphrases matching the model's training distribution better than formal logical vocabulary.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.
Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.