Reasoning and Learning Architectures

Can environment feedback replace scalar rewards in policy learning?

Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.

Note · 2026-05-18 · sourced from Reinforcement Learning
What actually changes inside a model during RL training? How well do reward models actually evaluate AI reasoning?

RLVR's central limitation is information-theoretic. The reward is a scalar per rollout. The environment, in many real verifiable settings, actually produces far richer signal: runtime errors, failing unit tests, judge evaluations, compile traces. RLVR collapses all of this to a single number. The scalar bottleneck creates the credit-assignment problem — which tokens caused the failure? The reward alone cannot say.

Self-Distillation Policy Optimization (SDPO, 2601.20802) introduces a different paradigm: Reinforcement Learning with Rich Feedback (RLRF). Tokenized environment feedback is the supervision signal. The conversion mechanism is elegant: the current policy conditioned on the feedback serves as the self-teacher. Its next-token distribution is what the policy "would have generated" had it known the feedback in advance. SDPO distills this feedback-informed distribution back into the unconditioned policy.

The trick is that no external teacher is required. Distillation usually needs a stronger model. SDPO leverages a different fact: the same model, when given retrospective evidence of its mistakes in-context, can identify what it should have done. The model is implicitly a process reward model — through retrospection — if given rich feedback. The student is bootstrapped by repeatedly imitating an improved version of itself, where "improved" means "conditioned on richer information."

The mechanism connects directly to Can agents learn from failure without updating their weights?. Reflexion converts environment feedback into stored verbal reflections used at the next rollout. SDPO converts environment feedback into gradient-distilled improvements to the policy weights. Both reject the scalar reward as load-bearing; both treat environment signal as already containing the teaching. SDPO is the parameter-updating analog of Reflexion's memory-updating mechanism.

A second connection is structural: this is in-context learning used as supervision. Since the model can integrate feedback in-context, the difference between the with-feedback and without-feedback distributions IS the gradient signal. The policy doesn't need to discover what to do — it needs to internalize what its with-feedback self already knows.

The implication for the broader RL landscape: each language model is implicitly a PRM through retrospection. The reward model is not load-bearing if rich tokenized feedback is available.


Paper: Reinforcement Learning via Self-Distillation

Related concepts in this collection

Concept map
16 direct connections · 99 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

rich tokenized environment feedback can be converted to dense credit assignment via self-distillation — the policy conditioned on feedback is its own teacher