How does sliding the start state backward create informative learning signals?
This explores the R3-style trick of starting a model's reasoning near the finish line and then walking the start point backward — and why that simple move turns plain right/wrong feedback into step-by-step learning signal.
This explores how sliding a reasoning task's start state backward — beginning near the answer, then progressively earlier — manufactures fine-grained learning signal from coarse outcome feedback. The core idea comes from reverse curriculum RL Can curriculum learning approximate expensive process supervision?: when a model starts almost at the solution, success or failure isolates the last step; slide the start back one notch and the next-to-last step becomes the thing being tested. Each backward shift quietly converts a single right/wrong outcome into a probe of one specific step, so a stream of cheap outcome rewards ends up carrying the same step-level information that expensive human process annotations would — without anyone labeling intermediate steps.
Why this produces *informative* signal rather than noise becomes clearer alongside work on RL's learning dynamics. Training tends to move through two phases — first nailing execution correctness, then wrestling with strategic planning Does RL training follow a predictable two-phase learning sequence?. A reverse curriculum is essentially a way to schedule that progression on purpose: near-completion states stress execution, and as the start slides back, more planning has to be reconstructed, so the difficulty rises exactly where the model is ready to learn next. It's a curriculum that keeps each new state just barely beyond what's already mastered.
The backward framing also shows up as a learning principle in its own right. Training models to reason *backward* — generating the question from the answer, or checking the inverse relationship — measurably sharpens their forward reasoning Can backward reasoning during training improve forward reasoning?. Both tricks exploit the same asymmetry: the endpoint of a solution is information-rich and easy to anchor on, so reasoning that radiates outward from it is better supervised than reasoning that gropes forward from a blank start.
There's a deeper reason backward states are informative, visible in research on agents learning from their own action consequences. When an agent treats the future states its actions lead to as supervision, it can learn without external rewards at all Can agents learn from their own actions without external rewards?. Sliding the start state backward is a controlled version of this: you're handing the model trajectories whose outcomes it can already partly see, so the consequence of each early choice is legible instead of buried under a long horizon of later decisions. The signal is informative precisely because the answer's proximity makes credit assignment easy.
Worth knowing: the informativeness isn't only about which steps succeed. Work on negative reinforcement finds that suppressing wrong trajectories alone often matches full RL while preserving diversity negative-reinforcement-alone-matches-or-exceeds-full-rl-by-suppressing-incorrec. A reverse curriculum that surfaces step-level failure modes is, in effect, a machine for generating well-localized negative examples — it tells you not just that an attempt failed, but where, which is the whole game in turning outcome feedback into something that teaches.
Sources 5 notes
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Training models simultaneously on forward reasoning, backward question generation, and backward reasoning improves forward-only performance by 13.53% average across 12 datasets. The mechanism: generating backward questions forces models to understand the inverse relationship between problem and solution, deepening understanding that transfers to forward reasoning without test-time overhead.
Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.