Does argument-scheme prompting improve reasoning in non-code domains the same way?
This explores whether framing prompts as formal argument structures — Toulmin warrants, critical questions, attack/defense graphs — helps LLMs reason in fuzzy natural-language domains the way step-by-step prompting helps in code and math.
This explores whether dressing a prompt in the clothes of formal argumentation actually buys you better reasoning outside the tidy world of code and math. The corpus says: partly yes, but the gains and the ceilings work differently than the code case. On the encouraging side, forcing a model to spell out the argument's hidden machinery does help. CQoT-style prompting asks the model to name its warrants and backing — the implicit 'why this follows' that standard chain-of-thought happily skips — and that extra step catches failures plain CoT lets through Can structured argument prompts make LLM reasoning more rigorous?. Formal frameworks push this further: structuring an answer as a traversable graph of attacks and defenses makes the reasoning contestable, so a reader can point at the exact premise they reject — something an unstructured paragraph never lets you do Can formal argumentation make AI decisions truly contestable?.
But here's the twist that makes non-code domains genuinely harder. Asking a model to *produce* a structured argument is not the same as asking it to *recognize* what kind of argument it's looking at — and the recognition task is where models stumble. Classifying argument schemes carries a higher cognitive load than other language tasks: the same systems that exceed F1 0.80 on tagging argument components or detecting stance plateau at 0.55–0.65 on scheme classification, because schemes live in inferential patterns smeared across distant spans of text, not in local surface cues Why does argument scheme classification stumble where other NLP tasks succeed?. Even the best models only get there with few-shot examples and explicit scheme descriptions; zero-shot fails uniformly, and smaller models hit a representational wall around 0.53 Can large language models classify argument schemes reliably?. So argument-scheme prompting in soft domains depends on a capability the model may not reliably have.
That matters because of a hard limit on what prompting can do at all. Prompt optimization only reorganizes knowledge already in the training distribution — it activates, it doesn't inject Can prompt optimization teach models knowledge they lack?. In code and math, the procedural skeleton of a valid solution is densely represented in pretraining; argument scaffolding just surfaces it. In a domain where the relevant inferential moves are sparse or contested, the same scaffold has less to grab onto. This connects to why CoT generalizes unevenly in the first place: reasoning that transfers rides on broad procedural knowledge drawn from many documents, not on retrieving specific facts Does procedural knowledge drive reasoning more than factual retrieval? — and there's evidence CoT is often constrained imitation of familiar reasoning *forms* rather than genuine inference, which is exactly why it degrades under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?.
The quiet lesson, then, is that 'the same way' is the wrong expectation. Whether structured prompting helps is contingent on the question, not the technique: saliency analysis shows step-by-step prompting actually *hurts* on simple items where the question should flow straight to the answer, and the optimal prompt shape varies by question type rather than task category Why do some questions perform better without step-by-step reasoning?. So argument-scheme prompting is best read as a targeted instrument — it earns its keep when an argument has load-bearing implicit premises worth excavating and the model already holds the relevant inferential patterns, and it adds friction when it doesn't. The doorway worth walking through is the gap between generating structure and recognizing it: the technique that makes a model's reasoning more contestable to humans rests on a classification skill the same models are demonstrably weak at.
Sources 8 notes
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.
Dung-style argumentation structures AI outputs as traversable attack/defense graphs, allowing users to identify and contest specific premises. Standard LLM outputs lack this structure, making it impossible to pinpoint which claims users actually reject.
Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.
Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.