How do self-evolving curricula help RL break beyond base model capability boundaries?
This explores whether a curriculum that generates its own progressively harder tasks can push RL past the ceiling of what the base model already knew — and the corpus is openly split on whether that ceiling can be broken at all.
This explores whether self-evolving curricula let RL break beyond base model capability boundaries — and the first thing to know is that the corpus disagrees about whether those boundaries can be broken at all. One camp argues RL mostly *redeploys* what's already latent: pass@k analysis shows base models matching or beating RLVR-trained models at high k, suggesting RL narrows sampling toward solutions already in the base distribution rather than adding new ones Does RLVR actually expand what models can reason about?, and related work frames verifiable rewards as catalysts that surface pretrained strategies rather than teachers of new ones How does RL training reshape reasoning and what gets lost?, with RL teaching *when* to reason, not *how* Does RL post-training create reasoning or just deploy it?. If that's the whole story, no curriculum can break a boundary, because there's no boundary-crossing to be had.
But the opposing result is exactly where curricula earn their keep. Prolonged RL on *diverse, non-mathematical* tasks — with KL control and policy resetting — produces models that beat the base across all pass@k levels, which the authors read as genuine boundary expansion, not just sampling efficiency Can reinforcement learning discover reasoning strategies base models cannot?. The operative words are 'prolonged' and 'diverse.' A static reward on a fixed task distribution collapses fast: RL converges on a single dominant pretraining format within the first epoch, suppressing alternatives Does RL training collapse format diversity in pretrained models?. A self-evolving curriculum is the mechanism that keeps the task distribution moving faster than the model can collapse onto it — it keeps feeding the model problems just past its current frontier, so there's always something the existing distribution can't already solve.
The deeper reason curricula matter is that *self-improvement alone is provably bounded*. Pure self-improvement stalls on the generation–verification gap, diversity collapse, and reward hacking; every method that actually works smuggles in an external anchor — a past model version, a third-party judge, a user correction, or tool feedback Can models reliably improve themselves without external feedback?, a limit that holds formally, not just empirically What stops large language models from improving themselves?. A self-evolving curriculum is one way to manufacture that external signal continuously: the environment itself becomes the judge. This is why VOYAGER's automatic curriculum works — it pairs an externalized, composable skill library with environmental feedback so the agent keeps exploring and refining instead of forgetting, escaping the catastrophic forgetting of pure weight updates Can agents learn new skills without forgetting old ones?.
The thing you might not have expected: the binding constraint is often the *curriculum's imagination*, not the model's capacity. Agents trained on static expert demonstrations are capped by what the curators imagined and can't learn from their own failures because they never interact with an environment Can agents learn beyond what their training data shows?. A self-evolving curriculum's whole value is that it removes the human imagination ceiling — it generates tasks the curator never thought to write down. And it has to do so carefully: training order mechanically reshapes entropy, with structured tasks draining output entropy while open-ended ones raise it, so scheduling structured-first yields measurable gains and protects creative capability from collapse Does training order reshape how models handle different task types?. That this scales is no longer hypothetical — RL now works in long-horizon, multi-turn settings with delayed rewards, doubling SWE-bench performance Can reinforcement learning scale beyond single-turn language tasks?, exactly the stateful environments where an evolving curriculum has room to run.
So the honest synthesis: a self-evolving curriculum doesn't magically add capability a model could never represent. What it does is keep RL from collapsing onto the base distribution by continuously supplying frontier-pushing tasks and an external grading signal — the two ingredients the 'RL only redeploys' results were missing and the boundary-expansion results happened to include.
Sources 11 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.