Does negative reinforcement alone outperform full reinforcement learning?
Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.
Decomposing RL's learning signal into positive sample reinforcement (PSR) and negative sample reinforcement (NSR) reveals a surprising asymmetry. Training with only negative samples — penalizing incorrect responses without ever reinforcing correct ones — consistently improves performance over the base model across the entire Pass@k spectrum (k up to 256), often matching or surpassing full PPO and GRPO.
The mechanism is straightforward through gradient analysis: NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs. It refines existing knowledge rather than introducing entirely new behaviors. This is because penalizing a wrong answer doesn't point toward any specific correct answer — it lets the model's own prior determine where the freed probability mass flows.
Positive-only reinforcement creates the opposite problem. It improves Pass@1 (the model gets better at its top-ranked answer) but degrades performance at higher k because it concentrates probability mass on rewarded trajectories, reducing diversity. Since Does policy entropy collapse limit reasoning performance in RL?, positive reinforcement actively contributes to the problem while negative reinforcement sidesteps it.
This reframes how we think about RL for reasoning. The conventional framing is that RL rewards correct behavior. But the evidence suggests that penalizing incorrect behavior may contribute more to performance than reinforcing correct behavior — especially when diversity matters. The model already contains good solutions in its prior; it just needs help avoiding the bad ones.
The practical implication is that reward design for reasoning RL may be over-engineered. If suppression alone gets you most of the way, the elaborate reward shaping and process supervision architectures may be solving a problem that's already largely solved by the base model's prior distribution.
Source: Reinforcement Learning
Related concepts in this collection
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
directly supports: pruning IS negative reinforcement at the reasoning path level
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends: positive reinforcement actively causes entropy collapse; negative reinforcement avoids it
-
Does reinforcement learning update only a small fraction of parameters?
Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
complementary: if RL only touches 5-30% of parameters, negative reinforcement may be the primary mechanism for this sparse selection
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
supports: if negative reinforcement alone suffices, algorithm choice matters even less
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
negative reinforcement alone matches or exceeds full rl by suppressing incorrect trajectories and redistributing probability mass