Can models learn to judge themselves without external rewards?
Can a language model train itself by alternating between generating responses and evaluating them using only internal consistency signals? This explores whether evaluation itself can become a learnable skill without external supervision.
Open-domain tasks (summarization, open writing, general QA) are where RL hits its hardest wall. RLVR needs verifiable answers; these tasks have none. RLHF needs external annotators or reward models; the cost is prohibitive and quality is fragile. Existing self-improvement methods (point-wise self-scoring + DPO) require supervised cold-start and depend on well-crafted standards, limiting cross-task generality.
SERL (2511.07922) proposes a structural escape: the model simultaneously plays Actor and Judge, with two synergistic reward mechanisms generated entirely from within. The Actor's reward comes from Copeland-style pairwise comparison judgments — for each input, sample multiple responses, conduct pairwise comparisons across them, rank by win rate within the group. The win-rate ranking becomes the training signal for generation. The Judge's reward comes from self-consistency across its own judgments — if the Judge ranks A>B and B>C, it should also rank A>C. Inconsistencies cost the Judge in a separate reward channel.
The two channels are synergistic, not redundant. Strengthening the Judge produces a more robust training signal for the Actor. Strengthening the Actor produces more diverse, distinguishable responses for the Judge to evaluate. Both abilities co-evolve through online learning.
The Copeland mechanism is specifically chosen because it converts subjective response quality into a relative ordering with provable consistency properties. Pairwise comparison reduces the abstract "which is better" question to a tractable judgment local to two candidates. Aggregating across all pairs produces a ranking. Win-rate-against-group becomes a scalar reward for each candidate without ever requiring an absolute quality score.
Empirically: SERL improves Qwen3-8B's AlpacaEval 2.0 LC win rate from 52.37% to 59.90% without any external reward signals.
The deeper move is the unification: the model's evaluation capability is itself trainable through self-consistency, while its generation capability is trainable through the evaluation's outputs. Generation and evaluation become two views of the same competence. This parallels Can reasoning during evaluation reduce judgment bias in LLM judges?: J1 converts judging to a verifiable problem; SERL converts judging to a self-consistency problem. Different routes to making evaluation a first-class trainable target, no external supervision required.
For the broader landscape: SERL is the third independent verifier-free RL pattern alongside ΔBelief-RL (belief shift) and SDPO (self-distillation from rich feedback). Each replaces a different component of the RLHF/RLVR stack. The reward model as a separately-trained module is no longer load-bearing.
Paper: SERL: Self-Examining Reinforcement Learning on Open-Domain
Related concepts in this collection
-
Can reasoning during evaluation reduce judgment bias in LLM judges?
Can training language model judges to think through their evaluations, rather than pattern-matching on surface features, mitigate the four known biases that make them vulnerable to manipulation attacks?
J1 makes judging a verifiable RL problem; SERL makes judging a self-consistency problem; both make evaluation a first-class trainable competence
-
Can environment feedback replace scalar rewards in policy learning?
Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.
SDPO escapes external supervision via feedback-conditioned self-teacher; SERL via self-judgment with consistency check — same goal, different mechanism
-
Can an agent's own beliefs guide credit assignment without critics?
Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.
ΔBelief-RL's intrinsic signal is target-grounded; SERL's is pairwise-relative — three different intrinsic-reward families converging on verifier-free RL
-
Can language models replace reward models with internal signals?
Recent RL research shows three independent patterns—self-judgment, belief-shift, and rich feedback—that each eliminate a component of the traditional RLHF stack. Are these patterns converging on a fundamentally different architecture for training without external verifiers?
meta-claim: SERL is one of three convergent paths
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
self-examining RL eliminates external reward dependence by alternating actor and judge roles — Copeland-style pairwise judgments produce ranking and self-consistency rewards