INQUIRING LINE

What distinguishes inductive inference from negative evidence versus positive patterns?

This explores a hidden asymmetry in how AI systems learn rules: whether learning from what's *wrong* (negative evidence, exceptions, ruled-out cases) works differently than learning from what's *right* (positive examples, confirming patterns) — both at inference time and during training.


This explores a hidden asymmetry in how AI systems learn rules: learning from what's *ruled out* versus learning from what *confirms a pattern*. The corpus suggests these are not mirror images — and the difference exposes something about where today's reasoning models are weak. The most direct evidence is that reasoning models actively struggle with negative evidence. On exception-based rule tasks — where the rule is defined partly by what doesn't fit — reasoning models scored below 25% while plain non-reasoning models hit 55–65% Why do reasoning models fail at exception-based rule inference?. Chain-of-thought made it worse, not better: it overgeneralized, hallucinated constraints, and forced patterns onto cases that were really telling the model 'no.' Positive patterns invite confident generalization; negative evidence demands restraint, and the reasoning machinery doesn't have that restraint.

The surprise is that during *training*, the asymmetry flips in negative evidence's favor. Training a model only on its wrong answers — suppressing incorrect trajectories without ever rewarding correct ones — matches or beats full reinforcement learning across the board Does negative reinforcement alone outperform full reinforcement learning?. Positive-only reinforcement actually degrades performance at higher k, because rewarding 'good' answers concentrates probability mass and collapses diversity. Negative reinforcement carves away the bad while leaving the space of possibilities open. So pruning what's wrong preserves more than reinforcing what's right — the opposite of the inference-time finding.

Putting these two together is the real payoff: negative evidence is *powerful as a training signal but poorly handled as an inference task*. The corpus hints at why. Several notes argue that what looks like reasoning in these models is really imitation of a reasoning *form* — chain-of-thought reproduces familiar patterns from training rather than performing genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?, invalid reasoning steps work nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and even deliberately corrupted traces teach about as well as correct ones Do reasoning traces need to be semantically correct?. Positive-pattern matching is exactly what imitation is good at. Recognizing an exception — noticing that a confident pattern must be *blocked* — is the thing imitation can't fake.

There's a deeper root here too. When models reason, they lean on semantic associations rather than formal logic; strip the familiar semantics and performance collapses even when the correct rule is sitting in context Do large language models reason symbolically or semantically?. Negative evidence is inherently a logical operation — it says 'this generalization does not hold here' — and that's precisely the symbolic move these models substitute association for. Positive patterns ride on association; negative patterns require the logic the models don't actually have.

If you want to follow the thread further, the work on whether traces are causally necessary at all Do reasoning traces actually cause correct answers? sharpens the point: if the visible reasoning is stylistic rather than functional, then the gap on negative evidence isn't a bug to patch in the prompt — it's a property of systems that pattern-match forward but can't reliably reason about what to exclude.


Sources 7 notes

Why do reasoning models fail at exception-based rule inference?

Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher re-testing claims about inductive inference under negative versus positive evidence in LLMs. The question remains open: do reasoning models genuinely struggle with negative evidence more than positive patterns, and does this gap reflect architectural or training-regime limits?

What a curated library found — and when (dated claims, not current truth):

Findings span 2023–2025, with sharp acceleration in 2025:

• Reasoning models score below 25% on exception-based rule tasks (rules defined by what *doesn't* fit), while non-reasoning models hit 55–65%; chain-of-thought worsens performance (arXiv:2505.24225, ~2025).
• Training on negative evidence alone (suppressing wrong answers without rewarding correct ones) matches or beats full RL; positive-only reinforcement degrades diversity and collapses probability mass (arXiv:2506.01347, ~2025).
• Chain-of-thought reproduces familiar patterns from training rather than performing genuine inference; logically invalid CoT steps perform nearly as well as valid ones (arXiv:2506.02878, arXiv:2307.10573, ~2023–2025).
• LLMs are semantic reasoners, not symbolic reasoners; strip familiar semantics and performance collapses even when the correct rule is in context (arXiv:2305.14825, ~2023).

Anchor papers (verify; mind their dates):
• arXiv:2505.24225 — Reasoning Can Hurt the Inductive Abilities of Large Language Models (2025)
• arXiv:2506.01347 — The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning (2025)
• arXiv:2305.14825 — Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners (2023)
• arXiv:2506.02878 — CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate (2025)

Your task:

(1) RE-TEST EACH CONSTRAINT. For the 25% vs. 55–65% exception-task gap: has newer model scaling, improved CoT variants (e.g., self-correction, tree-search, multi-agent debate), or post-hoc symbolic verification bridges or further widened this? Separately, verify whether negative reinforcement's advantage persists in larger models and whether it transfers across domains. Isolate what is durable (negative evidence likely remains hard for forward pattern-matchers) from what may be resolved (e.g., ensemble methods, neurosymbolic hybrids, or task-specific scaffolding).

(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months. Look for papers claiming reasoning models *do* handle negative evidence well, or showing positive-only RL fails in a key domain, or demonstrating CoT as genuinely symbolic. Flag disagreement directly.

(3) Propose 2 research questions that *assume the regime has moved*: (a) If negative evidence is truly a symbolic operation models cannot perform, can auxiliary symbolic modules or neuro-symbolic integration genuinely decouple negative reasoning from semantic association? (b) Does the asymmetry persist in multimodal or embodied settings, or does grounding in negative feedback (e.g., physical collision, outcome mismatch) override the semantic-vs-symbolic bottleneck?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines