Self-distillation Enables Continual Learning
Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations. Code and Datasets are available at http://idanshenfeld.com/SDFT.
Foundation models have achieved remarkable success in recent years, powering AI applications across language, vision, robotics, and beyond. However, despite their impressive capabilities, today’s AI systems remain static after deployment. While they can adapt their behavior at inference time through mechanisms such as retrieval or prompting, they do not update their parameters to acquire new skills, internalize new knowledge, or improve from experience. To enable the next generation of foundation models, we must solve the problem of continual learning: enabling AI systems to keep learning and improving over time, similar to how humans accumulate knowledge and refine skills throughout their lives (Hassabis et al., 2017; De Lange et al., 2021).
A growing body of recent work has highlighted the importance of on-policy learning for continual learning. When models learn from data generated by their current policy, they exhibit substantially reduced catastrophic forgetting compared to off-policy alternatives (Shenfeld et al., 2025; Chen et al., 2025). To date, most successful on-policy approaches have been developed in the context of reinforcement learning (RL), where feedback is provided through an explicit reward function. However, in many real-world settings such rewards are unavailable or difficult to specify. Instead, learning typically proceeds from datasets of expert demonstrations. The dominant paradigm in this regime is supervised fine-tuning (SFT), which trains the model to imitate expert actions under a fixed, offline data distribution. While simple and scalable, SFT is inherently off-policy, and prior work has shown that sequential SFT can lead to poor generalization and severe catastrophic forgetting when models are adapted to new tasks or domains (Kirkpatrick et al., 2017; Li & Hoiem, 2017). This tension raises a fundamental challenge for continual learning: how can we obtain the benefits of on-policy learning when only demonstrations are available?
The challenges of off-policy learning can, in principle, be overcome by first learning a reward function from demonstrations (i.e., Inverse Reinforcement Learning or IRL), and then performing on-policy RL (Ng et al., 2000; Abbeel & Ng, 2004). While IRL is conceptually elegant, effectively recovering rewards typically requires strong priors over the reward structure, which has limited its practical adoption to settings where such assumptions are justified, such as RLHF (Peng et al., 2018; Stiennon et al., 2020).
Rather than inferring an explicit reward function, we propose Self-Distillation Fine-Tuning (SDFT), an on-policy distillation (Ross et al., 2011; Agarwal et al., 2024) framework for learning directly from demonstrations. SDFT relies on the observation that large pretrained models exhibit strong in-context learning—the ability to adapt their behavior when conditioned on examples, without parameter updates (Brown et al., 2020). We exploit this property by using the same model in two roles: a teacher, conditioned on both the task input and an expert demonstration, and a student, conditioned only on the task input. Training distills the teacher’s predictions into the student on trajectories generated by the student itself, yielding on-policy updates that incorporate information from demonstrations without explicit reward inference or offline imitation.
We evaluate SDFT in two continual learning settings: skill learning, where demonstrations are used to improve performance on a task, and knowledge acquisition, where new information must be incorporated into the model. Across both settings, SDFT provides stable on-policy updates that enable learning while substantially reducing catastrophic forgetting compared to supervised learning. Consistent with prior work on on-policy learning (Ross et al., 2011; Chu et al., 2025), SDFT also improves generalization both in-distribution and out-of-distribution, making it beneficial even in settings where retaining prior capabilities is not the primary objective. In a sequential learning experiment involving three distinct skills, SDFT enables a single model to acquire each skill in turn while preserving performance on previously learned skills as well as on unrelated, pre-existing capabilities—demonstrating that continual learning from demonstrations is possible.