INQUIRING LINE

How do developmental curriculums emerge from learning progress signals?

This explores how a training 'curriculum' — an ordered progression from easier to harder learning — can arise out of signals about where a model is currently succeeding or failing, rather than being hand-designed in advance.


This explores how a learning curriculum can emerge from progress signals — the model's own pattern of successes and failures — instead of being scripted by hand. The corpus offers a few distinct mechanisms for this, and they're more interesting together than apart.

The most literal answer is reverse curriculum. In Can curriculum learning approximate expensive process supervision?, R3 starts the model near the end of a solved problem and slides the starting point backward as it succeeds, so difficulty ramps automatically with mastery. The progress signal is just outcome reward, but because the start state moves, that single coarse signal exposes step-level failure modes — effectively manufacturing the granularity of expensive process supervision for free. The curriculum isn't a syllabus; it's a moving boundary between what the model can already do and what it can't quite reach yet.

But curriculums also emerge whether or not anyone designs them. Does RL training follow a predictable two-phase learning sequence? finds that RL training reliably moves through two phases on its own: first execution correctness drives gains, then strategic planning becomes the bottleneck — visible because planning-token entropy keeps rising while execution entropy settles. That's a curriculum the training dynamics generate spontaneously, with the shifting bottleneck acting as the progress signal that tells you which skill to invest in next. Does sequencing imitation then exploration training improve reasoning? shows the designed version of the same logic: imitation first to create reasonable rollouts, then verifiable rewards to sharpen them, because outcome rewards only become informative once the model is producing attempts good enough to be told apart. Ordering is what makes the signal legible.

There's a sharp limit lurking here, though. Can models reliably improve themselves without external feedback? argues that progress signals generated purely from within a model eventually stall — the generation-verification gap, diversity collapse, reward hacking — and every reliable method secretly imports an external anchor (a past checkpoint, a judge, a tool, a user correction). So a self-emergent curriculum needs an outside reference point to keep measuring 'progress' against, or it congratulates itself into a corner. Should successful and failed episodes be processed differently? sharpens what the signal should carry: treat successes as concrete demonstrations and failures as abstracted lessons, an asymmetry that mirrors how human experts compress experience.

The quiet twist is what the curriculum is actually teaching. What does reward learning actually do to model reasoning? finds that reward-based training mostly activates strategies already latent from pretraining rather than installing new ones — a single example, or even a spurious reward, can trigger it. Read alongside the others, an emergent curriculum may be less a ladder of new skills than a search procedure for surfacing capabilities the model already has, in the right order, with progress signals telling it which latent ability to switch on next.


Sources 6 notes

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Next inquiring lines