Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

Paper · arXiv 2508.06026 · Published August 8, 2025

Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose Temporal Self-Rewarding Language Models that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) Anchored Rejection - fixing rejected responses using the past initial model’s outputs and (2) Future-Guided Chosen - dynamically curating chosen samples using next-generation model predictions.

Recent advances in Self-Rewarding(Yuan et al. 2024) language models demonstrate an alternative paradigm to self-improvement, where language models serve dual roles as both response generators and evaluators (Yuan et al. 2024; Wu et al. 2024). Specifically, the Self-Rewarding paradigm builds upon the Supervised Fine-Tuned (SFT) model through an iterative optimization cycle that: (1) generating candidate responses to given prompts, (2) using the same LLM to evaluate these responses via LLM-as-a-Judge prompting (Zheng et al. 2023; Li et al. 2023a; Wang et al. 2024a), and (3) selecting preference pairs from the highest and lowest scoring responses for DPO training (Rafailov et al. 2023). Most existing work has focused on enhancing the model’s judging capabilities to improve the effectiveness of the Self-Rewarding paradigm. For example, meta-rewarding approaches refine judgment skills through self-evaluation (Wu et al. 2024), while other methods include consistency regularization of reward models (Wang et al. 2024b), self-consistency mechanisms for internal rewards (Zhou et al. 2025), and process-based evaluation for mathematical reasoning (Zhang et al. 2025). Unlike traditional approaches that rely on static reward models or fixed preference datasets, these methods allow for the continuous co-evolution of generation and evaluation quality.

Despite the success of Self-Rewarding language models on benchmarks like AlpacaEval (Li et al. 2023c) and Arena- Hard (Li et al. 2024), our theoretical analysis reveals a critical limitation: when the representational similarity between chosen and rejected responses increases, the DPO gradient vanishes, causing the training process to collapse. This theoretical prediction is empirically validated by our findings - as quantified in Fig. 1, the representations of chosen and rejected responses in the Self-Rewarding paradigm become progressively similar, with the score gap between them shrinking by 9 times during the same period

This representational convergence directly leads to diminishing quality differences between generated answers, which in turn weakens or eliminates the learning signal for preference optimization. We attribute this convergence to reduced response diversity after reinforcement learning (Zhang et al. 2024; Kirk et al. 2023), which conflicts with the fundamental assumption of preference learning that requires clear quality differences between positive and negative samples for effective optimization (Lanchantin et al. 2025; Razin et al. 2025). The resulting vanishing gradient problem creates a vicious cycle where decreasing answer distinctness makes it harder to produce high-quality preference data, further exacerbating the learning signal deterioration.

To address the above issues, we propose Temporal Self- Rewarding Language Models that strategically coordinates past, present, and future model generations to maintain effective preference learning signals. Our approach consists of two key components: (1) Anchored Rejection that fixes rejected responses using outputs from the initial SFT model (past generation) to prevent quality inflation in negative samples, and (2) Future-Guided Chosen that selects high-quality positive samples by incorporating predictions from a future model version. The future model is obtained by first performing DPO training on the current model using the anchored rejection pairs, creating a temporary model that represents the next generation’s capabilities. This future model then helps produce superior responses that would otherwise be unavailable to the current model. By decoupling the chosen and rejected responses through this temporal approach, our method maintains clear differences between good and bad examples during training, as shown in Figure 1. Note that our method consumes the same computational resources with traditional Self-Rewarding approach because we adopts half the training iterations in the whole paper.