Reinforcement Learning for LLMs

Why do self-improvement loops eventually stop improving?

Self-improvement systems often plateau because the evaluator that judges progress stays static while the actor grows. What happens when judges don't improve alongside learners?

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time?

Meta-Rewarding (Llama-3-8B-Instruct) demonstrates that self-improvement loops stall not because the actor can't improve, but because the judge that evaluates improvement doesn't keep up. Prior self-rewarding work unified generator and evaluator in a single model, improving the actor through iterative DPO on self-generated preference pairs. But the judge capability remained static — the same evaluation quality was applied to increasingly sophisticated outputs. The result: saturation, or worse, reward hacking against a fixed evaluation surface.

The fix is a third role: the meta-judge. The model evaluates its own judgments using LLM-as-a-Meta-Judge prompting — selecting the better of two judgments on the same response. This creates preference data for the judge, not just for the actor. Training on both actor and judge preferences via DPO co-evolves both capabilities.

The results are surprisingly strong for an unsupervised method: AlpacaEval 2 win rate from 22.9% to 39.4%, Arena-Hard from 20.6% to 29.1%. The meta-judging step focuses on responses where the judge is least certain (highest score variance), targeting calibration at the decision boundary.

A practical complication: length explosion. With each iteration, responses grow longer because the judge has a length bias — a well-known reward model problem. Meta-Rewarding requires explicit length control to prevent this.

This is a different solution to the same problem addressed by Why does self-rewarding training collapse when responses improve?. Temporal anchoring fixes the preference signal (maintaining the gap between chosen and rejected). Meta-judging fixes the evaluator quality (making the judge more accurate). The two fixes are complementary — a system could use both.

The broader principle: any self-improvement loop where the evaluator doesn't improve alongside the learner will eventually stall. This applies to RLHF (frozen reward models), self-rewarding (same-model judging), and even human-in-the-loop systems where human evaluators don't recalibrate as models improve.


Source: Self Refinement Self Consistency Feedback — Meta-Rewarding Language Models (arxiv 2407.19594)

Related concepts in this collection

Concept map
16 direct connections · 122 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

self-improvement requires co-evolving the evaluator alongside the actor — a static judge becomes the ceiling that constrains actor training