Can environment feedback replace scalar rewards in policy learning?
Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.
RLVR's central limitation is information-theoretic. The reward is a scalar per rollout. The environment, in many real verifiable settings, actually produces far richer signal: runtime errors, failing unit tests, judge evaluations, compile traces. RLVR collapses all of this to a single number. The scalar bottleneck creates the credit-assignment problem — which tokens caused the failure? The reward alone cannot say.
Self-Distillation Policy Optimization (SDPO, 2601.20802) introduces a different paradigm: Reinforcement Learning with Rich Feedback (RLRF). Tokenized environment feedback is the supervision signal. The conversion mechanism is elegant: the current policy conditioned on the feedback serves as the self-teacher. Its next-token distribution is what the policy "would have generated" had it known the feedback in advance. SDPO distills this feedback-informed distribution back into the unconditioned policy.
The trick is that no external teacher is required. Distillation usually needs a stronger model. SDPO leverages a different fact: the same model, when given retrospective evidence of its mistakes in-context, can identify what it should have done. The model is implicitly a process reward model — through retrospection — if given rich feedback. The student is bootstrapped by repeatedly imitating an improved version of itself, where "improved" means "conditioned on richer information."
The mechanism connects directly to Can agents learn from failure without updating their weights?. Reflexion converts environment feedback into stored verbal reflections used at the next rollout. SDPO converts environment feedback into gradient-distilled improvements to the policy weights. Both reject the scalar reward as load-bearing; both treat environment signal as already containing the teaching. SDPO is the parameter-updating analog of Reflexion's memory-updating mechanism.
A second connection is structural: this is in-context learning used as supervision. Since the model can integrate feedback in-context, the difference between the with-feedback and without-feedback distributions IS the gradient signal. The policy doesn't need to discover what to do — it needs to internalize what its with-feedback self already knows.
The implication for the broader RL landscape: each language model is implicitly a PRM through retrospection. The reward model is not load-bearing if rich tokenized feedback is available.
Related concepts in this collection
-
Can agents learn from failure without updating their weights?
Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
Reflexion is the memory-update analog of SDPO's gradient-update mechanism; both leverage in-context retrospection
-
Can natural language feedback overcome numerical reward plateaus?
Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
Critique-GRPO uses NLF as an additional learning signal alongside scalar rewards; SDPO goes further by making feedback the only signal
-
Can generative reasoning beat discriminative models with less training data?
Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.
SDPO's claim that "each language model is implicitly a PRM through retrospection" provides the mechanism for why generative PRMs work — they exploit the same retrospection capability
-
Can reward models learn by comparing policies instead of judging them?
What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
both bypass labeled-preference RMs but via different mechanisms (similarity-to-target vs feedback-conditioned self-teacher)
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
rich tokenized environment feedback can be converted to dense credit assignment via self-distillation — the policy conditioned on feedback is its own teacher