Reinforcement Learning for LLMs

Why does self-rewarding training collapse when responses improve?

Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?

Note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Self-Rewarding Language Models (Yuan et al., 2024) merge the generator and the evaluator into a single model. The model generates candidate responses, evaluates them via LLM-as-a-Judge prompting, selects preference pairs, and trains via DPO. Each iteration improves both instruction following and reward quality. This co-evolution sidesteps the frozen-reward-model bottleneck — the evaluator grows alongside the generator.

But the approach hits a wall. The Temporal Self-Rewarding paper (2025) identifies the mechanism: as both chosen and rejected responses improve across iterations, their representations converge. The quality gap between "best" and "worst" responses shrinks — the score gap narrows by 9x. When chosen and rejected responses become representationally similar, the DPO gradient vanishes. The model can no longer learn because it can no longer distinguish good from bad.

This is a different failure mode from Does policy entropy collapse limit reasoning performance in RL? (which is about narrowing action diversity) and from How quickly do errors compound during model self-training? (which is about accumulating errors). Here, the model is actually improving — but the improvement itself destroys the preference learning signal.

The fix is temporal decoupling: (1) Anchored Rejection — fix rejected responses using outputs from the initial SFT model (past generation), preventing quality inflation in negative samples. (2) Future-Guided Chosen — select positive samples using a temporarily trained next-generation model, accessing superior responses unavailable to the current model. By decoupling chosen and rejected responses across temporal versions, the representational gap is maintained without additional compute (the method uses half the training iterations of standard Self-Rewarding).

The broader implication: any iterative self-improvement loop where the same model evaluates and generates will eventually converge unless the evaluation signal is anchored to an external reference point — whether that's a past model, a future model, or an external critic.

Complementary fix — Meta-Rewarding: Why do self-improvement loops eventually stop improving? addresses the same saturation problem from a different angle. While temporal anchoring fixes the preference signal (maintaining the chosen-rejected gap), Meta-Rewarding fixes the evaluator quality by adding a meta-judge that evaluates the judge's judgments. The two solutions are complementary: temporal anchoring prevents gradient collapse; meta-judging prevents evaluation stagnation. A system could use both — meta-judging to improve judge accuracy, temporal anchoring to maintain preference signal strength.


Source: Reward Models — Self-Rewarding Language Models (arxiv 2401.10020), Temporal Self-Rewarding Language Models (arxiv 2508.06026)

Related concepts in this collection

Concept map
16 direct connections · 118 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

self-rewarding iterative training creates a co-evolution loop but suffers gradient collapse when chosen-rejected responses converge — temporal anchoring to past and future models maintains the learning signal