Reinforcement Learning via Self-Distillation
Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context.
In this work, we argue that the key limitation is not RL per se, but the information bottleneck imposed by scalar outcome rewards. Many verifiable environments expose rich tokenized feedback beyond scalar rewards r, such as runtime errors, failing unit tests, or evaluations from an LLM judge. This feedback not only reveals whether a rollout was wrong, but also what went wrong. We formalize this more general setting as Reinforcement Learning with Rich Feedback (RLRF) and illustrate its difference to RLVR in Figure 2. Here, feedback can be any tokenized representation of any state reached by an agentic system. The central question becomes: how can we convert rich feedback into effective credit assignment without requiring external supervision from a strong teacher?
Building on this idea, we introduce Self-Distillation Policy Optimization (SDPO), an on-policy algorithm that performs RL via self-distillation. SDPO samples rollouts from the current policy, obtains rich environment feedback, and then minimizes a logit-level distillation loss that matches the current policy's next-token distribution to that of the self-teacher. Conceptually, SDPO addresses the central limitation of applying distillation to online learning: the absence of a stronger external teacher. Instead of relying on a fixed teacher, SDPO leverages the model's ability to recognize its own mistakes in hindsight. By conditioning the current policy on the rich feedback it just received, we construct a self-teacher that provides the dense supervision of distillation while retaining the exploration benefits of on-policy RL.
Conceptually, our work is related to "bootstrapping your own latent" (BYOL) and "expert iteration" where a student is bootstrapped by repeatedly imitating an improved version of itself (called the "expert"). Canonically, the expert combines the student with test-time search, such as tree search or majority voting. In contrast, SDPO leverages the student's ability to learn from rich feedback provided in-context, which is related to "augmented views" in BYOL. Unlike our RLRF setting, PRMs are typically trained on scalar rewards, either on value estimates for intermediate states or on outcome rewards. Unlike the self-teacher in SDPO, PRMs are a distinct model from the student, introducing significant memory overhead. Our work shows that each language model is implicitly a PRM through retrospection if given rich feedback.
We introduced Reinforcement Learning with Rich Feedback (RLRF), a paradigm where environments provide tokenized feedback beyond scalar rewards, and argued that this removes a key information bottleneck of RLVR. We further proposed Self-Distillation Policy Optimization (SDPO), which uses the current policy as a feedback-conditioned self-teacher and distills its corrected log-probabilities into the student. This leverages the model's ability to learn from context for dense credit assignment. We further demonstrated that SDPO can be implemented as a minimal, drop-in modification to standard RLVR pipelines. SDPO enables learning from rich feedback in a way that is arguably closer to human cognition: utilizing precise outcomes rather than just binary rewards. By allowing the model to determine retrospectively how it should have acted, we demonstrate that language models can convert diverse tokenized feedback into effective self-supervision.