INQUIRING LINE

Do computational systems need formal argument analysis for explainability?

This explores whether explainable AI actually requires formal, structured argument analysis (attack/defense graphs, argument schemes, critical-question checks) — or whether that's optional scaffolding — and what the corpus says about the payoff and the cost of building it in.


This explores whether computational systems *need* formal argument analysis to be explainable, or whether that structure is a nice-to-have. The corpus makes a surprisingly strong case that the structure isn't decoration — it's what makes a system's reasoning *contestable* rather than merely visible. A standard LLM answer is a wall of fluent text: you can read it, but you can't point at the one premise you reject and watch the conclusion fall. Framing outputs as Dung-style attack/defense graphs changes that, turning an opaque verdict into a graph you can traverse and challenge claim by claim Can formal argumentation make AI decisions truly contestable?. So 'explainable' splits into two things — being able to *see* the reasoning versus being able to *argue back* — and formal argumentation is mainly about the second.

The interesting tension is that the same formal structure also *improves* the reasoning, not just its legibility. When you turn Toulmin's argument model into explicit prompting steps — forcing the model to name its warrants and backing instead of skipping implicit premises — it catches failures that ordinary chain-of-thought lets slide Can structured argument prompts make LLM reasoning more rigorous?. That hints the answer to 'do we need it?' isn't purely about end-user trust; structured argument analysis is partly a reasoning prosthetic. And the cost can be low: a lot of what looks like explanation is actually filler. Concise reasoning chains match verbose ones at 7.6% of the token cost, because most removed tokens served style and documentation rather than computation Can minimal reasoning chains match full explanations?. Models even rank their own tokens by function, preserving the symbolic-computation steps and discarding meta-discourse first Which tokens in reasoning chains actually matter most? — which suggests the *load-bearing* part of an explanation is small and structured, and the rest is performance.

Here's the catch the corpus keeps circling back to: machines are bad at the formal analysis we'd want them to do. Classifying argument schemes — recognizing *what kind* of inference is being made — only works with larger models, few-shot examples, and explicit descriptions, and even then it plateaus around F1 0.55–0.65 while the same systems sail past 0.80 on simpler tagging tasks Can large language models classify argument schemes reliably?. That gap isn't random: scheme classification carries higher cognitive load because it requires integrating inferential patterns across scattered text spans rather than reading local surface features Why does argument scheme classification stumble where other NLP tasks succeed?. So we have a paradox — formal argument analysis is exactly the capability that would make systems explainable, and it's exactly the capability they're weakest at.

That weakness traces to something deeper. LLMs turn out to reason through semantic association, not symbolic logic: decouple the meaning from the rules and performance collapses even when the correct rules are sitting in context Do large language models reason symbolically or semantically?. Formal argumentation is symbolic by nature, so it cuts against the grain of how these models actually work — which is why imposing it externally (as prompt scaffolding or as a graph layer) does more than asking the model to be formal on its own. There's also a part of argument quality that no amount of structure recovers: the *force* of an argument depends partly on the authority of who's making it — reputation, track record, standing — and models process only text, so they can't distinguish an expert's claim from a confident commonplace Can language models distinguish expert arguments from common assumptions?.

The payoff of going formal isn't only inward-facing. Interpretable linguistic features — cheap, transparent, no neural black box — detect AI-written arguments at 99% accuracy by catching their tells: over-accommodation to the prompt and suspiciously textbook-clean argument markers Can simple linguistic features detect AI-written arguments?. So a useful reframing of the original question: you don't always need the system itself to perform formal argument analysis internally; sometimes a lightweight *external* layer of argument structure delivers the explainability and the auditability you were after. Whether you *need* it depends on which you want — a system you can read, or one you can genuinely argue with. Only the second requires the formal machinery.


Sources 9 notes

Can formal argumentation make AI decisions truly contestable?

Dung-style argumentation structures AI outputs as traversable attack/defense graphs, allowing users to identify and contest specific premises. Standard LLM outputs lack this structure, making it impossible to pinpoint which claims users actually reject.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Can large language models classify argument schemes reliably?

Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.

Why does argument scheme classification stumble where other NLP tasks succeed?

Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can language models distinguish expert arguments from common assumptions?

LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.

Can simple linguistic features detect AI-written arguments?

General linguistic features combined with argument-quality measures achieved 99% accuracy detecting LLM-generated counter-arguments on r/ChangeMyView, matching heavyweight neural detectors while remaining computationally cheap and transparent. LLMs produce detectable stylistic signatures: accommodation to prompts and textbook-quality argument markers that humans don't replicate.

Next inquiring lines