Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance.
The effectiveness of these outcome-based RL methods fundamentally depends on the policy model's ability to discover correct solutions within a limited rollout budget. However, given practical computational constraints, this learning paradigm struggles on challenging problems from the training data, where the model's success rate is effectively zero (when the pass@k rate remains zero even after sampling k rollouts). Such cases are increasingly common in tasks requiring complex, multi-step reasoning. For these problems, an incorrect intermediate step can derail the entire reasoning chain for a 7B-scale LLM, resulting in negative learning signals regardless of any partially correct solutions. Furthermore, naively penalizing all incorrect final outputs can further introduce training instability and hinder progress, making these difficult reasoning tasks largely intractable for standard outcome-based RL methods.
An alternative approach is imitation learning, commonly implemented via Supervised Fine-Tuning (SFT) on expert demonstrations. While SFT can instill valuable reasoning behaviors, its next-token prediction objective enforces rigid, token-level imitation, limiting the model's ability to generalize beyond the training data. This problem becomes particularly pronounced when training data are modest in scale and when the model itself is relatively less capable. Under such conditions, long and complex demonstrations often lead to overfitting and shallow reasoning behaviors. Consequently, both SFT and outcome-based RL struggle on challenging reasoning tasks, leaving a critical gap for training small open-source models to effectively learn difficult problems.
Distilling reasoning into smaller models via SFT on teacher-generated long Chain-of-Thought (CoT) rationales has proven highly effective for transferring complex problem-solving skills. Research indicates this process is surprisingly data-efficient, with small, high-quality datasets often being sufficient. Given the success, research has focused on the underlying factor for effective SFT distillation. Some emphasized the logical structure of the reasoning trace rather than its semantic correctness, as models can learn from demonstrations with factual errors. Moreover, significant challenges remain in the student-teacher gap where the student fails to learn from overly complex data, and the risk of teacher hacking, where the student overfits to a teacher's specific flaws. Ultimately, distillation from a teacher model imposes a performance ceiling on the student.
In conclusion, we introduced Supervised Reinforcement Learning (SRL), a novel method designed to teach LLMs complex reasoning skills from expert demonstrations, particularly for problems that are too difficult for conventional RL or SFT approaches. By breaking down expert solutions into manageable steps and leveraging a dense sequence similarity reward, SRL provides effective, granular guidance that bridges the gap between imitation learning and reinforcement learning. Our empirical results demonstrate that SRL not only significantly outperforms baseline methods in both mathematical reasoning and software engineering tasks but also enables a powerful curriculum learning strategy when combined with RLVR. This work establishes SRL as a robust and generalizable technique for unlocking a model's potential to learn from challenging, multi-step problems, paving the way for training more capable and versatile AI agents.