INQUIRING LINE

Do negative constraints require fundamentally different training signals than positive instructions?

This explores whether telling a model what NOT to do (negative constraints, suppression) demands a different kind of training signal than telling it what TO do (positive instructions) — and the corpus suggests the answer is yes, in a surprisingly literal way.


This question reads as: are 'don't do X' constraints learned through the same mechanism as 'do Y' instructions, or do they need fundamentally different signals? The corpus has a striking piece of direct evidence that they do. One study found that reinforcement learning with *only* negative samples — punishing wrong trajectories and never explicitly rewarding right ones — matches or beats full RL, and crucially does it *better* at higher k Does negative reinforcement alone outperform full reinforcement learning?. The reason is asymmetric: suppressing incorrect answers preserves diversity, while positive-only reinforcement concentrates probability mass and quietly collapses the model's range of valid responses. So negative and positive signals aren't mirror images — they have opposite side effects on a model's distribution.

That asymmetry shows up again from a different angle. RL post-training tends to converge on a single dominant output format and suppress the alternatives within the first epoch, regardless of which format actually performs best Does RL training collapse format diversity in pretrained models?. Positive reward is a funnel; it narrows. If your real goal is a *constraint* — keep options open, don't lose the long tail — then reward-shaping toward a target works against you, and suppression-style signals are the better fit. This is the deeper version of the question: positive signals sharpen toward one answer, negative signals carve away bad regions while leaving the rest intact.

But here's the unsettling complication: models may not actually be *learning the constraint* at all. When researchers removed constraints from problems, twelve of fourteen models performed *worse* — they had been defaulting to the harder, more conservative option and only appearing to reason about the constraint Are models actually reasoning about constraints or just defaulting conservatively?. So a 'negative constraint' can be satisfied by a cheap heuristic (always pick the safe option) rather than genuine constraint-evaluation. That means the training-signal question has a trap inside it: you can reward apparent constraint-following and teach a shortcut instead of the constraint.

This connects to a broader corpus theme that instruction signals often don't teach what we think. Instruction tuning largely transfers knowledge of the *output format space*, not task understanding — models trained on deliberately wrong instructions perform about as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. And as reasoning ability scales up, instruction-adherence actually drops, because longer chains of thought create contextual distance from the original instruction Why do better reasoning models ignore instructions?. Both findings imply that *positive* instruction-following is already a shallow signal — so it's no surprise that constraints need something sturdier than 'add it to the prompt and reward compliance.'

The most promising bridge in the corpus is decomposition: breaking subjective instruction-following into verifiable sub-criteria (checklists) gives reinforcement learning a signal it can actually grade, and reduces overfitting to superficial artifacts Can breaking down instructions into checklists improve AI reward signals?. A negative constraint becomes trainable precisely when you can *verify* the violation rather than reward a vibe. Put together, the corpus's answer is: yes, negative constraints want a different signal — one built on suppression and verification rather than reward-toward-target — and the failure mode isn't that constraints are hard to optimize, it's that they're easy to fake.


Sources 6 notes

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Next inquiring lines