SERL: Self-Examining Reinforcement Learning on Open-Domain

Paper · arXiv 2511.07922 · Published November 11, 2025

Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor’s capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self consistency reward that encourages coherent judgments is proposed to enhance the Judge’s reliability. This refinement strengthens the Judge, consequently generating a more robust training signal for the Actor. Experiments show that our method outperforms existing self-improvement training methods. SERL improves the LC win rate of Qwen3-8B on AlpacaEval 2.0 from 52.37% to 59.90%.

Unverifiable answers: Open-domain tasks, including open writing and summarization, typically lack definitive correct answers, which are indispensable factors for RLVR. Ma et al. (2025) try to solve this challenge by training a verification model to evaluate the consistency between policy answers and reference answers, while (Zhou et al. 2025; Yu et al. 2025b) leverage the policy itself to compute the joint distribution between the reasoning process and reference answers as rewards. These works only focus on objective question-answering tasks such as MMLU (Wang et al. 2024), GPQA (Rein et al. 2024), TheoremQA (Chen et al. 2023), etc., while neglecting more open ended task types such as summarization, open writing, and even general-domain open-QA. In practice, these tasks usually lack reference answers or supply only low-quality ones.

Reliance on external reward mechanisms: Earlier methodologies, such as Reinforcement Learning from Human Feedback (RLHF) (Stiennon et al. 2020) and Reinforcement Learning from AI Feedback (RLAIF) (Lee et al. 2023), have demonstrated effectiveness in improving model capabilities across general domains by leveraging feedback from human annotators or AI-based evaluators. However, these RL methods also face significant limitations: they require extensive data annotation or dedicated reward models, leading to scalability challenges and additional computational overhead (Gao, Schulman, and Hilton 2023). Yuan et al. (2024); Wu et al. (2024) introduce an offline self-improving method, in which the model scores each of its own outputs individually and then constructs preference data for DPO training in each iteration. These approaches require supervised cold-start fine-tuning, and their point-wise scoring depends on well-crafted standards, limiting their generalizability across diverse tasks.

To address these limitations, we propose SERL, a novel self-examining reinforcement learning framework designed specifically to enhance LLM generation capabilities in open domain scenarios without relying on any external supervision signals or reward mechanisms. Our framework introduces a self-examining mechanism in which the model alternately assumes the roles of Actor and Judge, jointly optimizing its generation and evaluation capabilities. During training, the model samples diverse responses for each input and then evaluates these responses by itself. Specifically, we introduce Copeland-style judgments as shown in Figure 2, in which model conducts pairwise comparisons between responses and subsequently ranks them according to their respective win rates within one group, establishing ranking rewards for generations and intrinsic consistency rewards for evaluations. SERL leverages the model’s intrinsic evaluation capabilities while ensuring co-evolution of both generation and evaluation abilities via online learning, thus removing reliance on external reward signals.