What distinguishes inductive inference from negative evidence versus positive patterns?
This explores a hidden asymmetry in how AI systems learn rules: whether learning from what's *wrong* (negative evidence, exceptions, ruled-out cases) works differently than learning from what's *right* (positive examples, confirming patterns) — both at inference time and during training.
This explores a hidden asymmetry in how AI systems learn rules: learning from what's *ruled out* versus learning from what *confirms a pattern*. The corpus suggests these are not mirror images — and the difference exposes something about where today's reasoning models are weak. The most direct evidence is that reasoning models actively struggle with negative evidence. On exception-based rule tasks — where the rule is defined partly by what doesn't fit — reasoning models scored below 25% while plain non-reasoning models hit 55–65% Why do reasoning models fail at exception-based rule inference?. Chain-of-thought made it worse, not better: it overgeneralized, hallucinated constraints, and forced patterns onto cases that were really telling the model 'no.' Positive patterns invite confident generalization; negative evidence demands restraint, and the reasoning machinery doesn't have that restraint.
The surprise is that during *training*, the asymmetry flips in negative evidence's favor. Training a model only on its wrong answers — suppressing incorrect trajectories without ever rewarding correct ones — matches or beats full reinforcement learning across the board Does negative reinforcement alone outperform full reinforcement learning?. Positive-only reinforcement actually degrades performance at higher k, because rewarding 'good' answers concentrates probability mass and collapses diversity. Negative reinforcement carves away the bad while leaving the space of possibilities open. So pruning what's wrong preserves more than reinforcing what's right — the opposite of the inference-time finding.
Putting these two together is the real payoff: negative evidence is *powerful as a training signal but poorly handled as an inference task*. The corpus hints at why. Several notes argue that what looks like reasoning in these models is really imitation of a reasoning *form* — chain-of-thought reproduces familiar patterns from training rather than performing genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?, invalid reasoning steps work nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and even deliberately corrupted traces teach about as well as correct ones Do reasoning traces need to be semantically correct?. Positive-pattern matching is exactly what imitation is good at. Recognizing an exception — noticing that a confident pattern must be *blocked* — is the thing imitation can't fake.
There's a deeper root here too. When models reason, they lean on semantic associations rather than formal logic; strip the familiar semantics and performance collapses even when the correct rule is sitting in context Do large language models reason symbolically or semantically?. Negative evidence is inherently a logical operation — it says 'this generalization does not hold here' — and that's precisely the symbolic move these models substitute association for. Positive patterns ride on association; negative patterns require the logic the models don't actually have.
If you want to follow the thread further, the work on whether traces are causally necessary at all Do reasoning traces actually cause correct answers? sharpens the point: if the visible reasoning is stylistic rather than functional, then the gap on negative evidence isn't a bug to patch in the prompt — it's a property of systems that pattern-match forward but can't reliably reason about what to exclude.
Sources 7 notes
Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.