Reasoning and Learning Architectures

Can models learn to judge themselves without external rewards?

Can a language model train itself by alternating between generating responses and evaluating them using only internal consistency signals? This explores whether evaluation itself can become a learnable skill without external supervision.

Note · 2026-05-18 · sourced from Reinforcement Learning
How well do reward models actually evaluate AI reasoning? What actually changes inside a model during RL training?

Open-domain tasks (summarization, open writing, general QA) are where RL hits its hardest wall. RLVR needs verifiable answers; these tasks have none. RLHF needs external annotators or reward models; the cost is prohibitive and quality is fragile. Existing self-improvement methods (point-wise self-scoring + DPO) require supervised cold-start and depend on well-crafted standards, limiting cross-task generality.

SERL (2511.07922) proposes a structural escape: the model simultaneously plays Actor and Judge, with two synergistic reward mechanisms generated entirely from within. The Actor's reward comes from Copeland-style pairwise comparison judgments — for each input, sample multiple responses, conduct pairwise comparisons across them, rank by win rate within the group. The win-rate ranking becomes the training signal for generation. The Judge's reward comes from self-consistency across its own judgments — if the Judge ranks A>B and B>C, it should also rank A>C. Inconsistencies cost the Judge in a separate reward channel.

The two channels are synergistic, not redundant. Strengthening the Judge produces a more robust training signal for the Actor. Strengthening the Actor produces more diverse, distinguishable responses for the Judge to evaluate. Both abilities co-evolve through online learning.

The Copeland mechanism is specifically chosen because it converts subjective response quality into a relative ordering with provable consistency properties. Pairwise comparison reduces the abstract "which is better" question to a tractable judgment local to two candidates. Aggregating across all pairs produces a ranking. Win-rate-against-group becomes a scalar reward for each candidate without ever requiring an absolute quality score.

Empirically: SERL improves Qwen3-8B's AlpacaEval 2.0 LC win rate from 52.37% to 59.90% without any external reward signals.

The deeper move is the unification: the model's evaluation capability is itself trainable through self-consistency, while its generation capability is trainable through the evaluation's outputs. Generation and evaluation become two views of the same competence. This parallels Can reasoning during evaluation reduce judgment bias in LLM judges?: J1 converts judging to a verifiable problem; SERL converts judging to a self-consistency problem. Different routes to making evaluation a first-class trainable target, no external supervision required.

For the broader landscape: SERL is the third independent verifier-free RL pattern alongside ΔBelief-RL (belief shift) and SDPO (self-distillation from rich feedback). Each replaces a different component of the RLHF/RLVR stack. The reward model as a separately-trained module is no longer load-bearing.


Paper: SERL: Self-Examining Reinforcement Learning on Open-Domain

Related concepts in this collection

Concept map
15 direct connections · 82 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

self-examining RL eliminates external reward dependence by alternating actor and judge roles — Copeland-style pairwise judgments produce ranking and self-consistency rewards