What makes reward signal sources substitutable across verifier-free RL patterns?
This explores why late-2025 RL work keeps finding it can swap out the reward source — a trained reward model, a critic, an explicit scalar — and still get comparable results, and what underlying property makes those sources interchangeable.
This explores why so many recent reinforcement-learning methods can swap one reward source for another and still work. The cleanest map of the territory is the observation that verifier-free RL has independently converged on three substitutable patterns: pairwise self-judgment standing in for the reward model, internal belief-shift standing in for the critic, and rich-feedback self-distillation standing in for the explicit reward signal Can language models replace reward models with internal signals?. The thread connecting all three is that the signal is generated from the policy's own computations rather than from an external trained classifier. Once the signal lives inside the model, the particular external apparatus you bolt on becomes optional — that's the structural reason substitution is even possible.
But there's a deeper, more surprising reason the sources are interchangeable: in much of this work the reward isn't teaching the model anything new. RLVR mainly sharpens sampling toward solutions already present in the base model's distribution rather than expanding what the model can solve Does RLVR actually expand what models can reason about?. The striking corollary is that spurious rewards work nearly as well as correct ones for a model with the right pretraining, and a single example can suffice to trigger the behavior What does reward learning actually do to model reasoning?. If the reward's job is to activate a latent strategy rather than to convey ground truth, then the information content of the signal is low — and a low-information signal is exactly the kind of thing you can substitute freely. The fungibility is downstream of the fact that the signal is doing activation, not instruction.
The same logic shows up from the opposite direction. Training on only negative samples — suppressing wrong trajectories while never reinforcing right ones — matches or beats full PPO and GRPO, because positive-only reinforcement collapses diversity by concentrating probability mass Does negative reinforcement alone outperform full reinforcement learning?. And the belief-shift pattern shows a model's own log-ratio of confidence in the target solution can serve as a dense per-turn reward with no critic network at all Can an agent's own beliefs guide credit assignment without critics?. Different sources, comparable outcomes — because each is a different route to the same internal nudge.
The limits of substitutability are just as instructive, and this is the part a curious reader might not expect. Sources stop being interchangeable the moment the reward has to carry information the policy can't supply itself. Natural feedback decomposes into evaluative information (how good was that?) and directive information (how should it change?), and a scalar reward can capture the first but throws away the second — which is why token-level distillation recovers something scalar rewards structurally cannot Can scalar rewards capture all the information in agent feedback?. The same boundary appears in what the signal's *shape* encodes: binary correctness rewards quietly degrade calibration by never penalizing confident wrong answers, and you can only fix that by adding a genuinely different term like a Brier score Does binary reward training hurt model calibration?; ternary rewards make abstention learnable in a way binary ones can't Can three-way rewards fix the accuracy versus abstention problem?; and decomposing instructions into checklist sub-criteria captures subjective quality a holistic reward misses Can breaking down instructions into checklists improve AI reward signals?.
So the answer is two-sided. Reward sources are substitutable when they're all just activating capabilities the base model already holds — the signal is a pointer, and pointers are cheap and interchangeable. They become non-substitutable the moment the reward has to add structure the model lacks on its own: directional feedback, calibration, abstention, fine-grained quality. The interesting takeaway is that 'which reward source' is the wrong question until you've asked 'is this reward teaching or merely activating?' — and most verifier-free RL turns out to be activating, which is precisely why it doesn't matter much where the signal comes from.
Sources 9 notes
Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.