How do logical forms of prompts influence what language models can derive?
This explores whether the *shape* of a prompt — its logical structure, phrasing, or argument form — changes what a model can actually reason its way to, versus whether models just respond to surface statistics regardless of logical form.
This explores whether the logical *form* you give a prompt — structured argument steps, formal rules, particular phrasings — genuinely expands what a model can derive, or whether models respond to something other than logic underneath. The corpus tells a two-sided story: form matters, but mostly as a way to *organize* what's already there, not to unlock new reasoning.
On the optimistic side, logical scaffolding does help. Casting a prompt as explicit argumentation — Toulmin-style critical questions that force the model to name its warrants and backing — catches reasoning failures that plain chain-of-thought slides past Can structured argument prompts make LLM reasoning more rigorous?. The form does work here: making implicit premises explicit changes what the model derives. But there's a hard ceiling. Prompt structure can only reorganize knowledge already in the model's training distribution; no logical form injects facts it never learned Can prompt optimization teach models knowledge they lack?.
The deeper catch is that models don't actually run on logical form at all. When you decouple the semantic content of a task from its logical structure — give correct rules but unfamiliar meanings — performance collapses. LLMs are *semantic* reasoners leaning on token associations and commonsense, not *symbolic* ones manipulating the logical form you handed them Do large language models reason symbolically or semantically?. Chain-of-thought sharpens this: it works by reproducing reasoning *shapes* seen in training, and degrades predictably under distribution shift — the signature of imitating a form rather than executing it Does chain-of-thought reasoning reveal genuine inference or pattern matching?.
That's why "logical form" is shakier than it looks. Two prompts with identical logical and semantic meaning can produce systematically different outputs purely because one phrasing appeared more often in pre-training — the model registers statistical mass, not equivalence Why do semantically identical prompts produce different LLM outputs?. And what reads as the model honoring your constraints is often a conservative default: strip the constraints away and most models do *worse*, revealing they were leaning on a safe heuristic, not reasoning about the logical structure you specified Are models actually reasoning about constraints or just defaulting conservatively?.
The thing you didn't know you wanted to know: logical form influences models less by being *logical* and more by being *familiar*. A well-formed argument prompt helps not because the model parses its validity, but because that argumentative shape is a high-frequency pattern it can imitate well. The form is a steering wheel for activating training-distribution behavior — not a compiler that executes your logic.
Sources 6 notes
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.