Why does self-rewarding training collapse when responses improve?

Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?

Note · 2026-02-22 · sourced from Reward Models

Self-Rewarding Language Models (Yuan et al., 2024) merge the generator and the evaluator into a single model. The model generates candidate responses, evaluates them via LLM-as-a-Judge prompting, selects preference pairs, and trains via DPO. Each iteration improves both instruction following and reward quality. This co-evolution sidesteps the frozen-reward-model bottleneck — the evaluator grows alongside the generator.

But the approach hits a wall. The Temporal Self-Rewarding paper (2025) identifies the mechanism: as both chosen and rejected responses improve across iterations, their representations converge. The quality gap between "best" and "worst" responses shrinks — the score gap narrows by 9x. When chosen and rejected responses become representationally similar, the DPO gradient vanishes. The model can no longer learn because it can no longer distinguish good from bad.

This is a different failure mode from Does policy entropy collapse limit reasoning performance in RL? (which is about narrowing action diversity) and from How quickly do errors compound during model self-training? (which is about accumulating errors). Here, the model is actually improving — but the improvement itself destroys the preference learning signal.

The fix is temporal decoupling: (1) Anchored Rejection — fix rejected responses using outputs from the initial SFT model (past generation), preventing quality inflation in negative samples. (2) Future-Guided Chosen — select positive samples using a temporarily trained next-generation model, accessing superior responses unavailable to the current model. By decoupling chosen and rejected responses across temporal versions, the representational gap is maintained without additional compute (the method uses half the training iterations of standard Self-Rewarding).

The broader implication: any iterative self-improvement loop where the same model evaluates and generates will eventually converge unless the evaluation signal is anchored to an external reference point — whether that's a past model, a future model, or an external critic.

Complementary fix — Meta-Rewarding: Why do self-improvement loops eventually stop improving? addresses the same saturation problem from a different angle. While temporal anchoring fixes the preference signal (maintaining the chosen-rejected gap), Meta-Rewarding fixes the evaluator quality by adding a meta-judge that evaluates the judge's judgments. The two solutions are complementary: temporal anchoring prevents gradient collapse; meta-judging prevents evaluation stagnation. A system could use both — meta-judging to improve judge accuracy, temporal anchoring to maintain preference signal strength.

Source: Reward Models — Self-Rewarding Language Models (arxiv 2401.10020), Temporal Self-Rewarding Language Models (arxiv 2508.06026)

Related concepts in this collection

How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
related iterative training failure; here the mechanism is convergence not error accumulation
Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
both show single-model self-evaluation limits; temporal decoupling and multi-agent debate are parallel solutions
Does revising your own reasoning actually help or hurt? Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
the external reference principle: internal evaluation degrades, external stabilizes
Why do models trust their own generated answers? Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.
self-rewarding inherits this bias; temporal anchoring partially addresses it

Concept map

16 direct connections · 118 in 2-hop network ·medium cluster

Why does self-rewarding training collapse when r… How quickly do errors compound during model self-t… Does a model improve by arguing with itself? Does revising your own reasoning actually help or … Why do models trust their own generated answers?

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

self-rewarding iterative training creates a co-evolution loop but suffers gradient collapse when chosen-rejected responses converge — temporal anchoring to past and future models maintains the learning signal