On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Paper · arXiv 2508.11408 · Published August 15, 2025
Reinforcement LearningTraining Fine TuningSelf Refinement Self Consistency FeedbackDomain SpecializationRAG

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data’s influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process.

While both SFT and RL aim to refine the capabilities of LLM, each paradigm presents its pros and cons. SFT primarily relies on high-quality expert trajectories to effectively mimic response patterns, which can be sensitive to the quality and quantity of expert data [15, 52, 61]. Recent studies also point out that SFT may struggle to generalize beyond mere memorization [8] and is vulnerable to exposure bias [4, 58]. In contrast, RL encourages LLMs to actively explore and enhance their potentially superior behaviors, which enables better generalization through learning from direct feedback on their on-policy generations [6, 8]. However, such explorations can sometimes be inefficient, leading to risks such as policy degradation caused by entropy collapse or over-exploitation of suboptimal strategies [16].

A prevalent and straightforward approach for integrating the strengths of SFT and RL while mitigating their weaknesses is the sequential SFT-then-RL paradigm [25, 30]. Intuitively, on one hand, SFT offers reasoning patterns that can guide exploration in RL to escape local optima [16]. On the other hand, the on-policy learning mechanism in RL can reduce the exposure bias inherent in SFT and prevent overfitting to a limited set of static examples. However, empirical observations show that the SFT-then-RL paradigm does not consistently outperform the pure RL approach, as illustrated in Figure 1, which is also noted in recent studies [6, 57].

In this study, we further investigate this phenomenon and demonstrate that suboptimal performance may arise from training on expert data that significantly diverges from the model’s own established patterns. Specifically, as shown in Figure 2, the learning curve reveals a “shift-readapt-overfit” progression consisting of three distinct phases. Firstly, there is an initial disruption in capability due to the sudden policy shift, which is followed by a readapt phase during which the model adapts to the expert’s patterns and recover performance. Finally, we observe that the LLMs eventually overfit to the expert data. These observations highlight that while expert data can bring new capabilities, it may also disrupt established patterns and induce overfitting during the training process.

Drawing upon these insights, we propose a novel principle that views SFT and RL through a unified lens of off-policy versus on-policy learning, which reframes SFT not as a separate tuning stage, but as a dynamically weighted auxiliary objective within the on-policy RL process. We further design CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting. The overall architecture of CHORD can be found in Figure 3, featuring a global coefficient for adjusting the overall influence of expert data throughout the training process (Section 3.2), and a fine-grained, per-token weighting function that helps maintain stability by down-weighting highly divergent tokens from off-policy data that could disrupt on-policy training (Section 3.3). Through such a dynamic weighting mechanism, CHORD enables us to control the influence of off-policy expert data while ensuring training stability when learning from off-policy experts. Extensive experiments demonstrate that CHORD significantly outperforms the compared baselines, achieving a higher performance through a balanced and flexible integration between learning from expert data and maintaining the model’s own exploration capabilities.