Reinforcement Learning for LLMs

Can curriculum learning approximate expensive process supervision?

Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?

Note · 2026-02-22 · sourced from Reasoning Architectures

The core challenge of applying RL to complex reasoning: how do you provide meaningful supervision when the reasoning chain is long, errors compound across steps, and step-level annotation is expensive? R3 (Reverse Curriculum Reinforcement Learning) solves this without human-annotated process supervision.

The mechanism: Instead of having the model reason from scratch (leading to sparse rewards and exponential search space), R3 starts the model from a state sampled from near the end of a correct demonstration. The model has already learned to solve most of the remaining chain; it only needs to generate the final few steps. Outcome supervision (correct or not) then provides informative feedback because success probability is high.

The start state then progressively slides backward toward the beginning of the demonstration. At each step, the model is reasonably likely to succeed (because it has already learned to solve everything ahead of it), and failure is informative (because the model was competent on the downstream steps). This creates a curriculum of gradually increasing exploration difficulty.

Why this approximates process supervision: Each position in the sliding curriculum implicitly tests the model on that specific step's difficulty. A model that succeeds at start position k but fails at start position k-1 has revealed that step k-1 is where its reasoning breaks down — even though only outcome supervision is used. The curriculum resolution increases with the granularity of start positions sampled.

The two-mode comparison:

This is a practical solution to the trade-off documented in Why do outcome-based reward models fail at intermediate step evaluation?.


Source: Reasoning Architectures

Related concepts in this collection

Concept map
13 direct connections · 120 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reverse curriculum rl approximates process supervision by progressively sliding the reasoning start state backward from near-completion