Why does scheme classification require more cognitive load than identifying premises?
This explores why getting a model (or a person) to name *which kind of argument* is being made is harder than spotting the pieces an argument is built from — and the corpus suggests it's because schemes live in the relationships between scattered parts, not in any single part you can point to.
This explores why scheme classification — naming the inferential pattern an argument follows (appeal to expert, cause-to-effect, analogy) — is harder than identifying premises, the raw claims an argument is built from. The short version from the corpus: premises are *local*, schemes are *relational*. Identifying a premise means tagging a span of text where it sits; classifying a scheme means recognizing how spans relate to each other across a whole argument. Why does argument scheme classification stumble where other NLP tasks succeed? makes this concrete — the same systems that exceed F1 0.80 on tagging components and detecting stance plateau at 0.55–0.65 on schemes, because the work isn't reading surface features but integrating an inferential pattern distributed across the text.
That gap looks less like a missing fact and more like a representational ceiling. Can large language models classify argument schemes reliably? finds zero-shot prompting fails uniformly; only larger models clear 0.55 even with examples and descriptions, while smaller ones stall around 0.53. The interesting wrinkle is *how* you describe the scheme: Why do paraphrased definitions work better than expert ones? shows plain paraphrases beat the formal expert (Walton) definitions, because paraphrases sit closer to what the model saw in training. So part of the 'cognitive load' isn't the reasoning itself — it's that the formal vocabulary of scheme theory is off-distribution. The model isn't reasoning its way to the category; it's pattern-matching, and you can lower the load by speaking its dialect.
Widen the lens and this is one instance of a recurring story: these systems handle local structure well and integrative structure badly. Why do large language models fail at complex linguistic tasks? documents the same shape in pure syntax — top models misidentify embedded clauses and complex nominals, and the errors worsen predictably as structural depth increases. Schemes are arguments' version of an embedded clause: the meaning is in the nesting, not the words. Does chain-of-thought reasoning reveal genuine inference or pattern matching? sharpens why that matters — chain-of-thought reproduces familiar reasoning *forms* rather than performing novel abstract inference, so when a task demands genuine structural recognition rather than recall, performance degrades in the tell-tale way.
The thing you didn't know you wanted to know: the difficulty may not be 'reasoning is hard' so much as 'familiarity is everything.' Do language models fail at reasoning due to complexity or novelty? argues reasoning failures track instance *novelty*, not task complexity — models fit patterns of specific instances rather than general algorithms. Read that against scheme classification and the plateau looks like a coverage problem: schemes are a large, fine-grained taxonomy where any given pattern is comparatively rare in training, so the model has thin instance-level familiarity to lean on. The same logic explains why descriptions and few-shot examples help so much — they're not teaching the reasoning, they're supplying the missing instances. If you want to push further on whether more inference compute could close gaps like this, Does reasoning ability actually degrade with longer inputs? is a useful caution: integrative reasoning degrades even with longer inputs and CoT, suggesting the bottleneck is structural recognition, not budget.
Sources 7 notes
Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.
Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.
LLM-generated descriptions of argument schemes yield better classification performance than expert Walton definitions. The advantage stems from paraphrases matching the model's training distribution better than formal logical vocabulary.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.