INQUIRING LINE

When does outcome reward signal become informative during model training?

This explores the timing question hiding inside outcome-only reward (RLVR): when a single end-of-trajectory signal — right or wrong — actually carries usable learning information, versus when it's too sparse, too late, or too blunt to teach anything.


This explores the timing question hiding inside outcome-only reward: when a single end-of-trajectory "right/wrong" signal actually carries usable learning information, versus when it's too sparse to teach anything. The corpus's sharpest answer is uncomfortable — outcome reward is often informative not because it teaches, but because it *activates* what the base model already knows. What does reward learning actually do to model reasoning? finds that RLVR improves how efficiently a model samples within its existing capability, and a single training example can suffice to trigger that. Does RLVR actually expand what models can reason about? sharpens it: at high sampling budgets the base model actually beats the RLVR'd one, so the outcome signal narrows the distribution toward solutions already present rather than expanding the frontier. The signal is "informative" mostly in the sense of a spotlight, not a teacher.

That reframes *when* into a question about the model's pretraining. Why do random rewards improve reasoning for some models but not others? is the cleanest demonstration: random or even incorrect rewards lift Qwen2.5-Math 16–25%, but do nothing for Llama or OLMo. The outcome signal becomes informative only when the pretraining left a latent strategy for it to surface — meaning the reward's value is conditional on the substrate, not intrinsic to the reward. Pour outcome reward into a model with no relevant latent behavior and you get nothing back.

The other half of the answer is about *granularity over the trajectory.* Outcome reward goes dark exactly when you need it most — on hard problems where every rollout fails, there's no positive signal at all. Can step-wise expert rewards help small models learn hard reasoning? addresses this directly, arguing for step-wise expert-similarity rewards as a curriculum *foundation* and explicitly positioning outcome-based refinement as the thing that comes *after* — outcome reward is informative late, once a model can already produce partially-correct trajectories worth distinguishing. Can an agent's own beliefs guide credit assignment without critics? and Can model confidence work as a reward signal for reasoning? take the complementary route: manufacture a dense, per-turn signal from the model's own shifting beliefs or confidence so you're not waiting for the outcome at all.

There's also a subtler "when": which direction of the signal matters. Does negative reinforcement alone outperform full reinforcement learning? finds that the *negative* half of outcome reward — suppressing wrong trajectories while preserving diversity — often matches full RL, while positive-only reinforcement collapses higher-k performance by over-concentrating probability mass. So outcome reward can be informative and harmful at once, depending on which side you lean on. Does binary reward training hurt model calibration? adds the cost: a bare binary outcome reward teaches confident guessing, because it never penalizes a confident wrong answer — informative for accuracy, corrosive for calibration, until you bolt on a proper scoring term.

The thread that ties these together, if you want to keep pulling, is that the field is quietly routing *around* the late, sparse outcome signal. Can reward models benefit from reasoning before scoring? and Can language models replace reward models with internal signals? show reward itself being made earlier and richer — reasoning before scoring, belief-shifts replacing critics, self-distillation replacing the reward signal entirely. The honest takeaway: outcome reward is most informative when the base model is already capable and you only need to re-weight what it knows. Whenever you're trying to teach something genuinely new, the answer to "when does it become informative?" tends to be "too late, and not enough" — which is exactly why so much of the corpus is inventing denser substitutes.


Sources 10 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

Next inquiring lines