Does compressing Walton's schemes into nine categories make LLM classification easier?

This explores whether collapsing Walton's many fine-grained argument schemes into a smaller set of nine categories actually helps LLMs classify them — and the corpus suggests the bottleneck isn't the number of categories so much as how the categories are described and how LLMs represent meaning.

This explores whether shrinking Walton's sprawling taxonomy of argument schemes down to nine buckets makes the classification job easier for a language model. The corpus doesn't run that exact experiment, but several of its findings point in a clear direction: fewer categories may help a little, but the real levers are elsewhere. The most direct evidence is that LLMs only classify argument schemes acceptably under narrow conditions — few-shot examples plus scheme descriptions — and even then larger models barely clear F1 0.55, with Claude topping out around 0.65, while smaller models plateau near 0.53 regardless (Can large language models classify argument schemes reliably?). That plateau looks like a representational ceiling, not a category-count problem, which hints that compressing the label set alone won't unlock much.

What *does* move the needle is surprising: LLM-generated paraphrases of the schemes outperform Walton's own formal definitions (Why do paraphrased definitions work better than expert ones?). The reason is that paraphrases sit closer to the model's training distribution than formal logical vocabulary does. So if you're going to compress nine categories, the win comes less from *how few* the categories are and more from *how the categories are worded* — describe each in the model's native idiom rather than in expert logic-speak.

There's a deeper reason compression is double-edged. LLMs already compress concepts far more aggressively than humans do, capturing broad category structure while shedding the fine-grained distinctions humans preserve for situated meaning (Do LLMs compress concepts more aggressively than humans do?). Folding Walton's schemes into nine categories plays *to* that tendency — coarser buckets match what the model naturally retains. But it cuts the other way too: if the nine categories still require distinguishing arguments by their underlying logical form, the model may struggle, because it reasons through semantic association rather than symbolic structure. When meaning is stripped away and only the logical skeleton remains, LLM reasoning collapses even with the correct rules in hand (Do large language models reason symbolically or semantically?).

That's the catch worth knowing: argument schemes are partly *formal* objects, and LLMs are not formal reasoners. A model can even produce a flawless explanation of a scheme and then fail to apply it to an actual argument — a disconnect between knowing and doing that doesn't look like a human knowledge gap (Can LLMs understand concepts they cannot apply?). So compressing to nine categories likely helps most if those categories are semantically distinct (different topics, different vocabulary) and helps least if they're formally distinct but semantically overlapping. Fewer, well-paraphrased, semantically-separable categories is the configuration the corpus would bet on — not nine for nine's sake.

Sources 5 notes

Can large language models classify argument schemes reliably?

Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.

Why do paraphrased definitions work better than expert ones?

LLM-generated descriptions of argument schemes yield better classification performance than expert Walton definitions. The advantage stems from paraphrases matching the model's training distribution better than formal logical vocabulary.

Do LLMs compress concepts more aggressively than humans do?

Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Does compressing Walton's schemes into nine categories make LLM classification easier?

Sources 5 notes

Next inquiring lines