Can reasoning chains work without logical validity?
This explores whether the step-by-step reasoning chains that boost LLM performance actually depend on the logic being correct — or whether something else entirely is doing the work.
This explores whether chain-of-thought reasoning needs to be logically valid to work — and the corpus has a surprisingly consistent, almost subversive answer: no, and that tells us what reasoning chains really are. The most direct evidence is that logically *invalid* CoT exemplars perform nearly as well as valid ones on hard benchmarks, which means the structural form of reasoning — not the soundness of the inference — is driving the gains Does logical validity actually drive chain-of-thought gains?. Push further and it gets stronger: deliberately *corrupted* reasoning traces teach models as well as correct ones, and sometimes generalize better out-of-distribution, which suggests the trace functions as computational scaffolding rather than meaningful argument Do reasoning traces need to be semantically correct?.
The unifying explanation across the collection is that CoT is *constrained imitation of reasoning form*, not genuine abstract inference Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning actually work?. Models reproduce familiar reasoning patterns from training rather than performing novel symbolic steps — which is exactly why format beats content. One note quantifies it bluntly: training format shapes reasoning strategy 7.5× more than the actual domain, and where you place a demonstration can swing accuracy 20% What makes chain-of-thought reasoning actually work?. If validity were the engine, none of these surface-level levers would matter so much.
There's a deeper wrinkle the corpus raises that you might not expect: even when the chain *looks* valid, the model may not be using it. Reasoning chains frequently fail both causal sufficiency (the steps don't actually determine the answer) and causal necessity (spurious steps are common), meaning most CoT evaluation measures output quality rather than whether the reasoning caused the result Do language models actually use their reasoning steps?. So logical validity is doubly beside the point — invalid chains still work, and valid-looking chains often aren't load-bearing anyway. A complementary finding shows you can dynamically prune 75% of reasoning steps with no accuracy loss, because verification and backtracking steps barely get attended to downstream Can reasoning steps be dynamically pruned without losing accuracy?, and a related result matches full CoT accuracy at 7.6% of the token cost — the other 92% was style and documentation, not computation Can minimal reasoning chains match full explanations?.
So what *is* doing the work, if not logic? The corpus points at semantic pattern-matching anchored to the training distribution. LLMs behave as in-context *semantic* reasoners, not symbolic ones: decouple the semantic content from the task and performance collapses even when the correct rules are sitting right there in the prompt Do large language models reason symbolically or semantically?. Reasoning failures track instance-level *unfamiliarity* rather than logical complexity — a chain of any length succeeds if the model saw similar instances, regardless of how hard the problem 'should' be Do language models fail at reasoning due to complexity or novelty?.
The quietly unsettling payoff: reasoning chains work *without* logical validity precisely because they were never doing logic. And the imitation has a cost — reasoning models actually underperform plain models on exception-based rule inference, where chain-of-thought injects math overuse, overgeneralization, and hallucinated constraints Why do reasoning models fail at exception-based rule inference?. The same predictable failure modes the critiques catalog Why does chain-of-thought reasoning fail in predictable ways? are the flip side of the same coin: a system that mimics the shape of reasoning will shine when the shape fits and break when genuine inference is required.
Sources 12 notes
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
LLM reasoning chains fail both causal sufficiency (steps don't always matter) and causal necessity (spurious steps are common). Research shows most CoT evaluation measures output quality, not whether reasoning actually caused the answer.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.