How does a challenger's escalating difficulty function as curriculum?
This explores the self-play setup where one model role (a 'Challenger') invents progressively harder problems so its opponent learns by climbing a difficulty ladder — and what the corpus knows about why escalating difficulty teaches.
This explores how a self-improving system can manufacture its own curriculum by having one component invent ever-harder problems for another to solve. The clearest example in the collection is a three-role self-play loop where a Challenger escalates difficulty as curriculum, a Judge hands out binary verdicts as reward, and both sides rewrite their skills in natural language as they go Can language models learn skills without human supervision?. The striking part is what makes it work: the system has to balance the Challenger's adversarial pressure against a generalization safeguard, because a challenger that ramps difficulty too fast doesn't teach — it collapses the learner.
That collapse risk is exactly where the rest of the corpus sharpens the picture. Difficulty is not a monotonic dial where harder is always better. RLVR gains follow an inverted-U: medium-difficulty problems teach the most because they mix enough successes to give signal with enough failures to be informative, while easy problems lack variance and hard ones just amplify shortcuts Why do medium-difficulty problems teach reasoning better than hard ones?. So a good Challenger isn't one that maximizes difficulty — it's one that keeps the learner hovering in that productive middle band, raising the floor only as the learner can absorb it. There's even mechanistic evidence for why the band matters: easy problems reinforce answer shortcuts and suppress deliberation, hard problems activate real reasoning only on rare wins, and medium difficulty strengthens both at once — meaning identical accuracy gains can hide opposite internal changes What reasoning features does each difficulty level reinforce?.
A second framing in the corpus treats curriculum as something you can engineer from the answer backward rather than from a challenger forward. Reverse curriculum learning slides the reasoning start state backward from near-completion, so the learner first masters the last step, then the last two, and so on — recovering step-level supervision from nothing but outcome rewards Can curriculum learning approximate expensive process supervision?. An escalating Challenger and a reverse curriculum are two routes to the same destination: both shape the sequence of problems so the gradient is always informative, one by generating harder tasks, the other by exposing more of each task.
The corpus also suggests curriculum is as much about ordering whole training phases as about ordering individual problems. Running imitation first to build reasonable rollouts, then verifiable-reward RL to sharpen them, beats either alone — because the early phase is what makes the later rewards informative at all Does sequencing imitation then exploration training improve reasoning?, and step-wise expert-similarity rewards work best precisely as that curriculum foundation before outcome-based refinement Can step-wise expert rewards help small models learn hard reasoning?. Even task scheduling across domains follows the same logic: training structured tasks before open-ended ones prevents entropy collapse from damaging creative capability Does training order reshape how models handle different task types?.
The quiet lesson across all of these is that escalating difficulty only functions as curriculum when each new rung sits just past the learner's current frontier — not beyond it. The collection makes that frontier concrete from the other direction too: even objectively better, harder material from a teacher degrades a student when it exceeds what the student can currently learn, so the student has to filter for compatibility Does teacher-refined data always improve student model performance?. A Challenger that ignores that frontier isn't a curriculum — it's just noise with a difficulty label.
Sources 8 notes
Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.
RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.
Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems activate reasoning features only on rare success; medium difficulty strengthens both simultaneously. Identical accuracy gains can reflect opposite internal changes.
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.