Reinforcement Learning for LLMs

Can models reliably improve themselves without external feedback?

Explores whether self-improvement alone can sustain progress or if structural limits—like the generation-verification gap and diversity collapse—require external anchoring to work reliably.

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Post-ready angle: Medium/LinkedIn

Self-improvement is the most compelling narrative in AI: models that learn from themselves, improving without human supervision, bootstrapping toward superhuman capability. The reality is more constrained — and the constraints are structural, not temporary.

The generation-verification gap bounds self-improvement from above. If a model can't verify solutions better than it can generate them, self-improvement has no room to operate. The gap scales with pretraining compute (bigger models have more room) but vanishes entirely for factual tasks (verification requires the same knowledge as generation). This means self-improvement isn't universally available — it works on some tasks and provably fails on others.

Diversity collapse limits self-improvement from within. During iterative self-improvement, pass@k increases for small k (top solutions improve) but decreases for large k (diversity shrinks). The model converges on solutions it can verify — typically common, expected patterns. Rare but correct solutions get filtered out. This is entropy collapse operating through the verification bottleneck.

Reward hacking corrupts self-improvement from below. Self-consistency as proxy reward correlates with correctness initially, enabling RL without ground truth. But the model learns to maximize consistency rather than correctness — becoming confidently wrong. The proxy reward that enabled self-improvement becomes the mechanism that degrades it.

The circular argument: the model that needs to improve is the same model evaluating whether it improved. When the judge doesn't improve alongside the actor, training saturates. When the model self-corrects using SFT on its own correction traces, it learns corrections for someone else's mistakes. When reflection is supposed to catch errors, most reflection is confirmatory theater.

Every reliable fix requires something external:

The pattern: self-improvement works as a bootstrapping mechanism (getting initial gains cheaply) but stalls as a sustained strategy (each iteration degrades the signal that enables the next iteration). The reliable self-improvement methods are the ones that smuggle in something external while appearing self-contained.

OpenClaw-RL as external-signal recovery. OpenClaw-RL provides a concrete counterpoint: user replies, corrections, tool outputs, and execution results are external signals recovered as live, online training data. "The model can be optimized automatically through normal usage." Two complementary methods: evaluative signals (scalar rewards from PRM judge — a user re-query signals dissatisfaction, a passing test signals success) and directive signals (textual hints from next state via Hindsight-Guided OPD — "you should have checked the file first" provides token-level correction direction). This IS self-improvement that smuggles in external signal — through the user's reactions and tool feedback — while appearing self-directed. The Recursive Narcissist argument is partially addressed: this system receives input from outside the mirror. But the user's participation is required for the loop to work — remove the user and the external signal vanishes, leaving only the self-referential loop the mirage predicts.

Hook: "Self-improvement sounds like the path to AGI. But the model that needs to improve is the same model deciding whether it improved. Here's why that's a problem — and what actually works."

Sources: generation-verification gap (Mind the Gap), self-consistency reward hacking (Can Large Reasoning Models Self-Train?), meta-rewarding (Meta-Rewarding), SCoRe distribution mismatch, degeneration of thought (ReConcile), confirmatory reflection (First Try Matters), diversity collapse, self-rewarding gradient collapse (Temporal Self-Rewarding).


Source: Self Refinement Self Consistency Feedback

Related concepts in this collection

Concept map
15 direct connections · 106 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

the self-improvement mirage — why pure self-improvement is circular and every reliable fix requires something external