Reinforcement Learning for LLMs

Does negative reinforcement alone outperform full reinforcement learning?

Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Decomposing RL's learning signal into positive sample reinforcement (PSR) and negative sample reinforcement (NSR) reveals a surprising asymmetry. Training with only negative samples — penalizing incorrect responses without ever reinforcing correct ones — consistently improves performance over the base model across the entire Pass@k spectrum (k up to 256), often matching or surpassing full PPO and GRPO.

The mechanism is straightforward through gradient analysis: NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs. It refines existing knowledge rather than introducing entirely new behaviors. This is because penalizing a wrong answer doesn't point toward any specific correct answer — it lets the model's own prior determine where the freed probability mass flows.

Positive-only reinforcement creates the opposite problem. It improves Pass@1 (the model gets better at its top-ranked answer) but degrades performance at higher k because it concentrates probability mass on rewarded trajectories, reducing diversity. Since Does policy entropy collapse limit reasoning performance in RL?, positive reinforcement actively contributes to the problem while negative reinforcement sidesteps it.

This reframes how we think about RL for reasoning. The conventional framing is that RL rewards correct behavior. But the evidence suggests that penalizing incorrect behavior may contribute more to performance than reinforcing correct behavior — especially when diversity matters. The model already contains good solutions in its prior; it just needs help avoiding the bad ones.

The practical implication is that reward design for reasoning RL may be over-engineered. If suppression alone gets you most of the way, the elaborate reward shaping and process supervision architectures may be solving a problem that's already largely solved by the base model's prior distribution.


Source: Reinforcement Learning

Related concepts in this collection

Concept map
13 direct connections · 109 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

negative reinforcement alone matches or exceeds full rl by suppressing incorrect trajectories and redistributing probability mass