Why do smaller LLMs fail at zero-shot argument scheme classification?
This explores why small models stumble specifically at zero-shot argument scheme classification — the task of naming the underlying reasoning pattern of an argument (e.g. argument from cause, from expert opinion) without any examples — and the corpus suggests it's less about size alone than about a kind of reasoning these models do badly at any scale.
This explores why small models stumble specifically at zero-shot argument scheme classification, and the most useful thing the corpus offers is to break the failure into two separate causes that get blurred together. The first is task difficulty, which hits models of all sizes; the second is a capacity threshold that smaller models simply fall below. On the difficulty side, scheme classification is unusual: unlike tagging argument components or detecting stance — where the same systems sail past F1 0.80 — naming a scheme means recognizing an inferential pattern spread across a whole passage, not a local surface cue Why does argument scheme classification stumble where other NLP tasks succeed?. That's exactly the kind of integrative reasoning LLMs are weakest at: when a task depends on structure rather than familiar token associations, performance collapses because these models reason through semantic association, not symbolic manipulation of rules Do large language models reason symbolically or semantically?.
Zero-shot makes this worse in a way that has nothing to do with the scheme itself. With no examples and no descriptions, the model has to map an abstract, formal label onto raw text — and the corpus shows that zero-shot prompting fails uniformly across every model tested, large and small Can large language models classify argument schemes reliably?. Part of why is vocabulary: formal Walton-style scheme definitions sit outside the model's training distribution, and simply paraphrasing those definitions into plainer language measurably improves classification Why do paraphrased definitions work better than expert ones?. So in the zero-shot setting the model is being asked to do its hardest kind of reasoning, with no examples, using vocabulary it handles poorly. The wonder is less that small models fail than that anything succeeds.
Where model size actually enters is as a floor. Once you add few-shot examples and descriptions, a gap opens: larger models climb past F1 0.55 (Claude reaching 0.65), while smaller models plateau around 0.53 — a representational-capacity threshold the small models can't cross even with help Can large language models classify argument schemes reliably?. This echoes a broader pattern: LLMs make systematic linguistic errors that worsen predictably as syntactic and structural complexity rises, because statistical learning captures surface patterns but not deep grammatical structure Why do large language models fail at complex linguistic tasks?. Smaller models have less of whatever representational headroom lets larger ones partially compensate.
The more interesting reframe is that the small-model failure may be a sharper version of a ceiling that constrains all of them. Across genuinely structural tasks — constraint satisfaction, iterative numerical methods — LLMs converge on a plateau regardless of parameter count, suggesting a fundamental limit rather than a scaling gap Do larger language models solve constrained optimization better? Do large language models actually perform iterative optimization?. There's even a predictive lens here: if you treat an LLM as a machine that prefers high-probability continuations, you can forecast that low-probability targets — like an unfamiliar formal scheme label — will be systematically hard Can we predict where language models will fail?. Scheme classification looks like exactly such a target.
If you want a doorway out, the corpus points to it: small models can be lifted toward large-model performance on structured tasks not by scaling but by targeted training. DPO on a teacher model's correct-and-incorrect examples beats plain fine-tuning precisely because the explicit negative examples attack the rigid format failures small models are prone to Can small models match large models on function calling?. That suggests the small-model deficit in argument schemes is partly a teachable gap — and partly an instance of the deeper structural-reasoning ceiling no amount of size fully removes.
Sources 9 notes
Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.
LLM-generated descriptions of argument schemes yield better classification performance than expert Walton definitions. The advantage stems from paraphrases matching the model's training distribution better than formal logical vocabulary.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.