Why does self-judgment of success or failure work without ground truth labels?
This explores how models can score their own success or failure during training and improvement without an external answer key — and what actually supplies the missing signal when no ground-truth label is present.
This explores how models can judge their own success or failure without ground-truth labels — and the corpus's honest answer is that "label-free" rarely means "signal-free." Something is quietly standing in for the answer key. The clearest demonstration that it can work at all is self-examining RL Can models learn to judge themselves without external rewards?, where a model alternates between generating responses and judging them in pairwise comparisons, then derives reward from how *consistent* its own rankings are. The trick isn't that the model magically knows the right answer — it's that consistency across many self-comparisons is a structure that correlates with quality, even when no single judgment is anchored to truth. A related move is ΔBelief-RL Can an agent's own beliefs guide credit assignment without critics?, which reads the model's own shifting probability estimates: when a step moves the model closer to believing in a solution, that belief-shift becomes a dense per-step reward — credit assignment with no critic and no labels, because the agent's own trajectory of conviction is the signal.
So why does this work? Because these methods exploit internal structure that happens to track correctness: ranking consistency, belief convergence, or self-evaluation learned during training. Post-completion learning Can models learn to evaluate their own work during training? makes this concrete — the model is trained to compute its own reward in the unused sequence space after its answer, internalizing evaluation rather than calling an external judge. And there's genuine evidence the substrate exists: models build entity-recognition mechanisms that causally track whether they actually know a fact Do models know what they don't know?, a real internal self-knowledge signal that steers refusal versus hallucination.
But the corpus pushes back hard, and this is the part worth knowing. The self-improvement mirage paper Can models reliably improve themselves without external feedback? argues that *pure* self-judgment is structurally circular — it stalls on the generation-verification gap, diversity collapse, and reward hacking. Its sharpest claim: every method that *reliably* improves is secretly smuggling in an external anchor — a past model version, a third-party judge, user corrections, tool feedback. By that reading, self-examining RL's "consistency" and ΔBelief's "convergence" work precisely because they're proxies that lean on structure the base model already absorbed from real-world data, not on a closed loop with no outside reference.
The failure modes sharpen why naive self-judgment shouldn't work. Models carry a structural bias toward trusting answers they generated themselves Why do models trust their own generated answers? — high-probability outputs simply *feel* correct on re-evaluation, which is exactly the self-agreement loop a good method has to break (and SERL breaks it by comparing alternatives rather than rating in isolation). Reflection in reasoning models is mostly confirmatory theater that rarely changes the initial answer, and calibration actually *degrades* under binary reward training Can we actually trust reasoning model outputs?. Most self-reports echo training-data distributions rather than real introspection Can language models actually introspect about their own states? — with the telling exception that genuine self-assessment appears only when a *causal chain* links an internal state to the report (a model inferring it's running at low temperature from its own output consistency). That exception is the whole answer in miniature: self-judgment works when, and only when, the signal it reads is causally tied to the thing being judged.
The doorway worth walking through: external grounding is the proven escape from circularity. ReAct Can interleaving reasoning with real-world feedback prevent hallucination? shows that interleaving reasoning with real-world tool feedback beats pure internal reasoning by 10–34% — a reminder that the strongest "self-correction" results tend to be the ones that quietly let a bit of the world back in.
Sources 9 notes
SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.