LSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

Paper · arXiv 2510.14200 · Published October 16, 2025
Reinforcement LearningDomain SpecializationReward ModelsAlignmentData

After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and enable efficient domain adaptation with minimal data. SFT relies on the next-token prediction objective to strengthen instruction following in a base model using a large corpus of human-labeled responses. In contrast, RFT employs a RL-based approach to adapt fine-tuned reasoning models to specific domains with limited supervision. Inspired by RFT, we propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework, thereby improving the base model’s instruction-following ability. In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses. RLSR can be utilized in multiple ways. It can directly replace SFT, achieving superior performance on instruction-following benchmarks-for example, RLSR (SB) on Qwen-7B (INFINITY) achieved an AlpacaEval win rate of 26.34%, surpassing SFT’s 21.01%. Furthermore, combining SFT and RLSR further enhances downstream task performance; Qwen-7B (INFINITY) achieved a win rate of 30.73% when trained with SFT + RLSR.

Large Language Models (LLMs) have demonstrated transformative potential across a myriad of tasks following their initial "pretraining stage". However, to become truly capable and safe assistants, they must undergo subsequent post-training to enhance their instruction-following ability, mitigate undesired responses, improve complex reasoning capability, and enable efficient domain adaptation. This typically involves techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF) [1], Reinforcement Learning with Verifiable Rewards (RLVR) [2], and Reinforcement Fine-Tuning (RFT) [3].

The prevailing standard for initial alignment is SFT. SFT leverages a dataset of high-quality, human-labeled (prompt, response) pairs, optimizing the model with a next-token prediction objective (cross-entropy loss) [4]. While effective and computationally stable, SFT is a "teacher-forcing" method that strictly enforces human-labeled generations at the token level, inherently limiting the model’s "exploratory capacity" and potentially confining it to suboptimal local optima.

In contrast, methods based on Reinforcement Learning (RL), such as RLHF and RLVR, have proven highly effective for complex objectives like safety and reasoning. These approaches shift the objective from token-level match to "maximizing a reward signal", allowing for a crucial balance between exploration and exploitation. RLHF, for instance, trains a Reward Model (RM) on human preference data to guide a subsequent policy optimization step, typically with Proximal Policy Optimization (PPO) [5] or its variants like Group Relative Preference Optimization (GRPO) [2]. In RLVR, the reward model is replaced with a verifiable function to improve the model’s capability on math and coding tasks. Similarly, RFT adapts a fine-tuned reasoning model to specific domains by utilizing a programmable grader to assign rewards for nuanced objectives.

This disparity raises a critical question: Can we replace or augment SFT with a purely RL-based approach that makes full use of the existing, extensive SFT human-labeled dataset to improve the base model’s core instruction-following capability?

Inspired by the success of RL-based methods in enhancing model capabilities beyond SFT’s token-level constraints, we propose Reinforcement Learning with Supervised Reward (RLSR). Borrowing ideas from RFT, RLSR re-frames the SFT process within an RL framework. For each prompt, the base model generates multiple candidate responses. Instead of relying on a learned reward model or sparse correctness signals, RLSR computes a reward score for each candidate based on the cosine similarity in the semantic embedding space between the generated response and the human-labeled response. This reward function directly leverages the high-quality SFT data while introducing the necessary element of exploration inherent to RL. We define the reward r(x, y, y∗) as the cosine similarity between the embeddings E(x, y) of the prompt and generated response y and E(x, y∗) of the prompt and human-labeled response y∗. We demonstrate that RLSR can be utilized in two powerful ways:

  1. Direct SFT Replacement: RLSR can directly substitute the SFT phase, achieving superior performance on instruction-following benchmarks.

  2. SFT Enhancement: RLSR can be applied as a subsequent fine-tuning stage after SFT, creating an SFT + RLSR pipeline that further boosts downstream task performance.

Although RLHF is effective, it demands substantial computational resources-it requires maintaining separate reward, value, and policy networks-and often suffers from instability during training. To alleviate these challenges, GRPO has been proposed as a simplified alternative to PPO. Unlike PPO’s actor–critic framework, which trains a value function using cumulative rewards from the RM, GRPO replaces the learned value function with the mean reward computed across multiple sampled responses. This modification eliminates the need for a separate value model, thereby reducing FLOPs, memory usage, and training complexity.

LLMs fine-tuned with RLHF and GRPO exhibit a strong ability to suppress undesired or harmful generations. However, these methods still struggle on complex reasoning tasks such as mathematics or programming. To address this limitation, RLVR has been introduced. Unlike RLHF, which relies on a learned reward model, RLVR uses verifiable, automatically RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following computed signals-such as correctness of a mathematical answer or whether generated code passes test cases-as rewards. Although these rewards are sparse, they are precise and directly aligned with task success. The combination of RLVR and GRPO, as implemented in DeepSeek R1 [3], substantially improves an LLM’s reasoning ability, enabling longer and more coherent chains of thought (CoT) and fostering “aha moments,” where the model recognizes and corrects prior reasoning errors.

While RLHF primarily focuses on mitigating harmful outputs and RLVR enhances reasoning performance, RFT aims to adapt an OpenAI reasoning model (e.g., o4-mini) to expert-level performance in domain-specific applications [6]. Instead of relying on fixed “correct” answers as in SFT, RFT employs a programmable grader function that evaluates model outputs using user-defined feedback signals. Typical components of the grader include (1) string matching, (2) semantic similarity, (3) score-model grading, (4) label-model grading, and (5) Python-based code execution, which optionally utilizes the human-labeled responses. The model is trained to prioritize high-scoring outputs, aligning generation behavior with nuanced objectives such as style, safety, or domain accuracy. This paradigm is particularly valuable in low-resource or domain-shift settings where human-labeled data are scarce and traditional fine-tuning methods generalize poorly.

From previous studies, it can be observed that RL-based methods, which incorporate a balance between exploration and exploitation, substantially enhance the capabilities of LLMs. However, the SFT stage remains a purely teacher-forcing process, involving only exploitation without exploration. This naturally raises a question: Can the SFT data itself be leveraged within an RL framework to further improve model performance? Motivated by the question of how to effectively leverage SFT data within an RL framework, and inspired by the RFT paradigm, we propose RLSR, a method that fine-tunes the base model using reinforcement learning while fully utilizing the existing SFT dataset. The key component of RLSR lies in the design of its reward function, which is derived directly from the SFT prompt–response pairs labeled by humans. In line with the principle of SFT, the reward reflects the semantic similarity between the model-generated response and the human-labeled response, under the assumption that semantically closer responses are of higher quality and more desirable. Following the embedding-based approaches adopted in DSSM [7], CLIP [8], and RAG [9], we define the reward as the cosine similarity between the embeddings of the generated response yi and the human-labeled response y∗. Let E(·) denote the text encoder that extracts embeddings from responses.

LLMs undergo extensive pretraining on vast datasets, enabling strong generative capabilities but often lacking precise alignment with human expectations or task-specific requirements [4]. Post-training techniques, such as SFT and RL-based methods, are critical for enhancing instruction-following, reasoning, and ethical alignment [38]. This section reviews key advancements in these areas, emphasizing their contributions to instruction-following and their relevance to our proposed RLSR method.

SFT is a cornerstone of LLM post-training, optimizing models on high-quality, human-labeled prompt-response pairs using a next-token prediction objective [4]. Studies, such as those on TULU [14] and INFINITY [15], demonstrate that SFT significantly improves instruction-following. However, its teacher-forcing paradigm limits exploration, potentially trapping models in local optima and constraining generalization [39]. This rigidity motivates exploration of RL-based alternatives that balance exploitation with exploratory learning.

RLHF addresses SFT’s limitations by optimizing models against a reward model trained on human preference data, typically using Proximal Policy Optimization (PPO) [1, 5]. RLHF enhances alignment with human values, as seen in models like GPT-4 and Claude [40, 41]. To reduce the cost of human feedback, Reinforcement Learning from AI Feedback (RLAIF) employs AI-generated feedback, offering scalability while maintaining alignment quality [42, 43]. For tasks requiring complex reasoning, such as mathematics and coding, RLVR uses precise, task-specific reward signals, improving performance on benchmarks like hendrycks_Math, humanEval and AIME [20, 21, 44]. GRPO, a simplified RL approach, further enhances efficiency by eliminating the need for a separate value model [2]. The integration of RLVR and GRPO serves as the cornerstone of the DeepSeek R1 model [3].

RFT focuses on adapting reasoning models to specialized domains with limited data by employing programmable grader functions that assess outputs through semantic similarity or code execution metrics [6]. It emphasizes adaptability and flexibility in low-resource settings, offering reward designs that may incorporate human-labeled responses. However, RFT does not explore improving base models for general instruction-following, nor does it leverage large-scale SFT data or embedding-based rewards to enhance alignment with human intent.

Recent research has shown that while SFT primarily promotes memorization, RL encourages generalization [39]. Embedding-based reward functions, as demonstrated in DSSM and CLIP, provide strong semantic similarity signals [7, 8]. Motivated by these findings and previous identified gaps, RLSR extends SFT by integrating large-scale SFT data into an RL framework, using cosine similarity in embedding space to reward responses consistent with humanlabeled data. In doing so, RLSR directly enhances the instruction-following ability of base models, achieving broader generalization and alignment without relying on domain-specific or sparse reward signals.