Reasoning and Learning Architectures

Can language models replace reward models with internal signals?

Recent RL research shows three independent patterns—self-judgment, belief-shift, and rich feedback—that each eliminate a component of the traditional RLHF stack. Are these patterns converging on a fundamentally different architecture for training without external verifiers?

Note · 2026-05-18 · sourced from Reinforcement Learning
What actually changes inside a model during RL training? How should we allocate compute budget at inference time?

The RLHF-RLVR stack rests on three load-bearing components: a reward signal (preference labels for RLHF, verifiers for RLVR), a reward model trained on that signal, and a policy optimizer (PPO, GRPO) that consumes the RM's output. Each component has scaling problems. Preference labels are expensive and culturally biased. Verifiers exist only for verifiable domains. Reward models suffer from prompt-context blindness, reward hacking, and generalization failure.

Late-2025 RL papers are independently converging on three substitutable patterns that each replace one component without touching the others. Together they suggest the reward-model-as-separate-module is no longer load-bearing — it can be replaced by mechanisms internal to the policy itself.

Pattern one: pairwise self-judgment. Can models learn to judge themselves without external rewards?. The model plays Actor and Judge alternately. Copeland-style ranking of self-generated responses produces the training signal for the Actor; self-consistency on those rankings produces the signal for the Judge. Two channels co-evolving, no external supervision. Replaces: the reward model.

Pattern two: internal belief-shift. Can an agent's own beliefs guide credit assignment without critics?. The change in the agent's own probability assigned to the target solution is the dense intrinsic reward. Log-ratio of sequential beliefs is computed from a single forward pass. Replaces: the critic / PRM.

Pattern three: rich-feedback self-distillation. Can environment feedback replace scalar rewards in policy learning?. Environment feedback (runtime errors, judge text, compile traces) becomes the supervision. The current policy conditioned on feedback serves as the self-teacher. Distill the feedback-informed next-token distribution back into the policy. Replaces: the explicit reward signal.

Each pattern can in principle compose with the others. SERL + ΔBelief gives you self-judgment AND dense intrinsic signal. SDPO + SERL gives you rich feedback AND self-evaluation. The substrate is the same — the language model with appropriate in-context conditioning — and each component performs a different role that the others cannot.

The structural claim: RL is being decomposed into substitutable parts. Pretraining + verifier was one architecture. Pretraining + intrinsic signal is another. Pretraining + self-judgment is a third. None of these requires the reward-model-as-trained-classifier component that defined classical RLHF. The reward model was load-bearing for absolute-preference RLHF; for the verifier-free patterns, it is replaced by mechanisms that emerge from the policy's own computations.

The writing angle worth tracking: if the reward model goes away, what changes about alignment? RLHF was inseparable from a specific architectural commitment — train a reward model to encode human preferences, then optimize against it. Verifier-free RL leaves the preference-encoding question open. Where does alignment come from when the RM is not the locus? Some answers: rich feedback (the environment carries it), self-judgment (the model encodes it), community feedback (citations encode it). The substitutability of mechanisms is also a fragmentation of where alignment lives.

A second worth tracking: this is what learning without supervision looks like when the model is already capable enough to retrospect, judge, and assess its own beliefs. Each pattern leverages an in-context capability of the model. The patterns work because the model is good enough at the relevant in-context task to bootstrap supervision from itself.

Related concepts in this collection

Concept map
14 direct connections · 73 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

verifier-free RL is converging on three substitutable patterns — pairwise self-judgment, internal belief-shift, and rich-feedback self-distillation