What types of math proofs benefit most from proof-by-contradiction framing?
This asks about proof-by-contradiction as a math technique — but the corpus doesn't cover proof strategies directly; what it does have is a sharp body of work on whether LLM-generated 'proofs' of any framing are valid reasoning or just convincing form.
This reads as a question about which proof types reward a contradiction framing — and here's the honest pivot: the collection has no paper on proof-by-contradiction (or induction, or direct proof) as a mathematical technique. What it has instead is a more unsettling adjacent finding, which is that for today's models the *framing* of a proof may matter far less than whether any genuine inference is happening underneath it at all. If you came looking for 'when does contradiction work best,' the corpus answers a sneakier question: 'does the proof structure you see mean anything?'
The strongest thread is that LLMs reproduce the *shape* of reasoning without its substance. Invalid chain-of-thought exemplars perform nearly as well as logically valid ones Does logical validity actually drive chain-of-thought gains?, and RLVR post-training makes adjacent steps more coherent while leaving the global proof potentially invalid Does RLVR actually improve mathematical reasoning or just coherence?. The form/content gap is the headline: format shapes reasoning strategy far more than the actual logic does What makes chain-of-thought reasoning actually work?. So a model can produce a beautifully staged proof-by-contradiction — assume the negation, derive an absurdity — where each move is locally plausible but the contradiction never actually bites. The framing is decorative, not load-bearing.
That fragility is concrete. Math reasoning collapses when you merely change the numbers or insert an irrelevant clause, which marks it as pattern-matching rather than symbolic deduction Does LLM math reasoning truly generalize or just pattern match?. And the very models that *look* most rigorous — long-chain reasoners like o1 and R1 — are the most exposed: they hit only 20–23% on constraint-satisfaction problems that demand real backtracking Can reasoning models actually sustain long-chain reflection?, and their extended chains create more points where one corrupted step propagates Why do reasoning models fail under manipulative prompts?. Proof-by-contradiction is exactly the genre most vulnerable to this, since it depends on a long chain remaining valid all the way to the absurdity — one fabricated intermediate step and the whole 'contradiction' is hollow.
Where the corpus *does* gesture at an answer to your underlying instinct — that some proof framings are sturdier than others — is in the formalization work. Partial symbolic abstraction beats both pure natural language and full formalization, because selective structure adds rigor without discarding meaning Why does partial formalization outperform full symbolic logic?. Semi-formal templates that force explicit premises act as 'completeness certificates,' catching gaps free-form reasoning glides past Can structured templates make code reasoning more reliable than free-form thinking?, and Toulmin-style critical-question prompts force a model to surface the warrant it would otherwise skip Can structured argument prompts make LLM reasoning more rigorous?. Translated to your question: the proof framing that 'benefits most' isn't determined by the math topic — it's whichever framing forces every hidden premise into the open. Proof-by-contradiction earns its keep precisely when the structured negation makes an otherwise-skipped assumption explicit and checkable.
So the thing you didn't know you wanted to know: the right question for this corpus isn't 'which proofs suit contradiction' but 'which framing makes a model show its work.' A contradiction frame helps most where it converts an implicit leap into an explicit, falsifiable claim — and helps least where it just gives the model more rope to generate fluent, locally-coherent, globally-empty steps.
Sources 9 notes
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
GSM-Symbolic found that LLMs show high variance across question reformulations, decline sharply when numbers change, and fail when irrelevant but related clauses are inserted. These failures indicate probabilistic pattern-matching rather than true symbolic reasoning.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.
Semi-formal templates requiring explicit premises, code-path traces, and evidence checks improved patch equivalence accuracy from 78% to 88%, catching cases like function shadowing that free-form reasoning missed. Templates act as completeness certificates without formal verification.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.