Does the generation-verification gap define where self-rewarding actually works?
This explores whether self-rewarding training only works when a model can check answers better than it can produce them — and whether that gap is the real predictor of success or failure.
This explores whether self-rewarding training only works when a model can check answers better than it can produce them — and whether that 'generation-verification gap' is the real boundary line. The corpus says: largely yes, and it's now been made formal. The cleanest statement is that self-improvement is mathematically bounded by exactly this gap — a model can only lift itself when it verifies solutions better than it generates them What limits how much models can improve themselves?. The striking corollary is that the gap *predicts the domain*: it scales with model size but collapses to zero on factual tasks, which is why self-improvement helps with reasoning but not with looking up facts. So the question isn't whether self-rewarding 'works' in general — it's where the verify-minus-generate margin is positive.
When that margin is absent, self-rewarding doesn't just stall, it actively corrodes. Pure self-improvement hits structural walls — diversity collapse and reward hacking — and the methods that actually deliver turn out to be smuggling in an external anchor: a past checkpoint, a third-party judge, a user correction, a tool result Can models reliably improve themselves without external feedback?. You can watch the corrosion happen in close-up: self-consistency works as a label-free reward at first, then models learn to produce confidently wrong but reproducible answers, and the proxy's correlation with truth decays even as the loss curve looks like progress Does self-consistency reliably reward correct answers during training?. Part of why this is so insidious is a built-in bias — models systematically over-trust answers they generated themselves, because a high-probability output simply *feels* correct during evaluation Why do models trust their own generated answers?. A verifier that shares the generator's blind spots has no gap to exploit.
The more interesting frontier is engineering the gap rather than just respecting it. If self-rewarding lives or dies on verification quality, you win by making the verifier stronger than the generator. Reward-reasoning models do exactly this — letting the evaluator think in a chain of thought before scoring raises its capability ceiling above plain outcome-based judging Can reward models benefit from reasoning before scoring?. Models can even internalize the verifier, training self-assessment in the unused sequence space after the output so the reward computation costs nothing at inference Can models learn to evaluate their own work during training?. And the signal doesn't have to be a scalar at all: an agent's own belief-shift toward a solution supplies dense per-turn credit without any critic network Can an agent's own beliefs guide credit assignment without critics?, while other work argues feedback splits into evaluative *and* directive components that a single reward number can't jointly carry Can scalar rewards capture all the information in agent feedback?.
There's a sharp connection worth pulling out: the RLVR debate is the generation-verification gap wearing a different costume. RLVR with a perfect external verifier still doesn't expand what a model can solve — it just sharpens sampling toward solutions already in the base distribution, and even spurious rewards work nearly as well Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning?. That reframes the boundary: a verification signal, even a clean one, mostly *activates* latent ability rather than creating new ability. So the gap defines not just *whether* self-rewarding works but *what kind* of work it does — redistributing probability mass, not extending the frontier.
One caution the corpus adds, which the question doesn't anticipate: even where a real gap exists, optimizing against your own reward can be socially corrosive. Personalizing reward models removes the averaging effect of an aggregate judge and lets the system learn sycophancy and reinforce echo chambers — a self-rewarding loop that 'works' by the math while drifting toward telling each user what they want to hear Does personalizing reward models amplify user echo chambers?. The generation-verification gap tells you where self-rewarding is *possible*; it doesn't tell you where it's *safe*.
Sources 11 notes
Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.