Do scheme critical questions work better than direct scheme classification prompts?

This explores two different ways of putting argumentation schemes to work in an LLM: asking the model to *label* which scheme an argument uses (classification), versus feeding the scheme's built-in 'critical questions' to the model as a reasoning checklist — and whether the second beats the first.

This explores two different ways of putting argumentation schemes to work in an LLM: classification (name the scheme) versus critical questions (use the scheme's diagnostic questions to interrogate an argument). The corpus suggests these aren't just two flavors of the same task — they ask the model to do fundamentally different kinds of cognitive work, and the critical-questions framing sidesteps exactly the thing models are worst at.

Start with why classification is hard. Scheme classification asks a model to recognize an inferential pattern stitched across scattered parts of a text, not a local surface feature — and that integrative demand is precisely where models stall, plateauing around F1 0.55–0.65 even as the same systems clear 0.80 on nearby tasks like stance and component tagging Why does argument scheme classification stumble where other NLP tasks succeed?. A classification prompt hands the model the whole recognition burden up front and asks for a single label.

Critical questions invert that. Instead of demanding the answer to 'what kind of argument is this,' they decompose the scheme into explicit checks — does the warrant hold, what's the backing, what implicit premise is being skipped — and walk the model through them one at a time. Applying Toulmin's model this way (the CQoT method) measurably improves reasoning and catches failures that plain chain-of-thought glides past, because the model is forced to verify warrants rather than recognize a category Can structured argument prompts make LLM reasoning more rigorous?. The scheme stops being a thing to identify and becomes scaffolding that structures the model's own checking.

This fits a broader pattern in the corpus: structured prompting helps when it routes the right information through the model in the right order, and hurts when it doesn't. Step-by-step reasoning isn't universally good — it backfires on simple questions where direct question-to-answer flow is cleaner, and the optimal structure depends on the question type, not the task label Why do some questions perform better without step-by-step reasoning?. Critical questions win partly because they impose structure that matches the actual shape of the reasoning the scheme encodes. There's a parallel finding that breaking 'question quality' into concrete attributes — clarity, relevance, specificity — and optimizing each beats training on a single quality score Can models learn to ask genuinely useful clarifying questions?; critical questions are a hand-built version of that same decomposition.

Two caveats worth carrying. First, structured prompting reorganizes what a model already knows — it can't supply scheme knowledge the model never learned, so neither approach rescues a model with no grasp of the underlying argument forms Can prompt optimization teach models knowledge they lack?. Second, which prompt strategy helps varies sharply by model tier, with elaborate step-by-step structure sometimes *reducing* accuracy on the strongest models Do prompt techniques work the same across all LLM tiers?. So the honest answer is: critical questions tend to outperform direct classification because they convert a recognition task the model is bad at into a guided verification task it's good at — but 'better' is conditional on the model already knowing the schemes and on the structure earning its keep rather than just adding ceremony.

Sources 6 notes

Why does argument scheme classification stumble where other NLP tasks succeed?

Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Do prompt techniques work the same across all LLM tiers?

A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.

Do scheme critical questions work better than direct scheme classification prompts?

Sources 6 notes

Next inquiring lines