Bridging Offline and Online Reinforcement Learning for LLMs

Paper · arXiv 2506.21495 · Published June 26, 2025
Reinforcement LearningReward ModelsSelf Refinement Self Consistency Feedback

We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

For the optimization method itself, several candidates are commonly considered. When learning from preference labels, Direct Preference Optimization (DPO) (Rafailov et al., 2024) has emerged as a powerful algorithm and became a popular choice for open-ended tasks due to its simplistic offline training (Xu et al., 2024a). It can be used with verifiable rewards (Pang et al., 2024) or with reward models (Xu et al., 2023b). DPO can also be used in a semi-online (iterative) fashion (Pang et al., 2024; Yuan et al., 2024). More recently, however, Group Relative Policy Optimization (GRPO) (Shao et al., 2024) has become widely used for fine-tuning LLMs in an online fashion for its success in training thinking LLMs (Guo et al., 2025). GRPO is based on a popular RL algorithm PPO (Schulman et al., 2017a) which belongs to a class of online training methods that try to estimate the gradient of the reward signal.

While recent models have achieved impressive benchmark results, the relative importance of various offline to online training approaches and their generalization performance across different tasks remains poorly understood. In this paper, we systematically explore the effectiveness of LLM post-training methods in different training setups by bridging the gap between offline and online methods. Specifically, we study offline, semi-online, and online configurations, across both verifiable and non-verifiable tasks, as depicted in Figure 1. By examining the transition from offline to online training, i.e., by altering the speed of periodic model syncing, we aim to understand how these methods can be optimized for improved performance and efficiency on any task. Our investigation focuses on two key aspects: the comparative effectiveness of semi-online or fully online training over offline training and the relative performance of DPO and GRPO objectives across verifiable and non-verifiable tasks.

LLM alignment or post-training is performed after the initial pre-training stage. The de-facto task definition for LLM alignment is an instruction following task where the model input specifies instruction and auxiliary task constraints, and a (typically human-written) response is used as the target. Due to its practical scalability, supervised fine-tuning (SFT) was initially the most common approach to post-train using high-quality instruction following data (Touvron et al., 2023a,b; Zhou et al., 2023). Reinforcement Learning from Human Feedback (RLHF) was proposed before the rise of assistant-like LLMs (Ziegler et al., 2019), and it it was only relatively recently that it was used to outperform SFT methods (Ouyang et al., 2022). This was made possible by instruction following datasets being annotated with a set of responses and human preference labels corresponding to each response, allowing the training of reward models. Initial RLHF models were finetuned using Proximal Policy Optimization (PPO) (Schulman et al., 2017a). More recently, Direct Preference Optimization (Rafailov et al., 2023) and Group Relative Policy Optimization (Shao et al., 2024) have become the gold standard finetuning methods for aligning language models. We detail these methods in the following subsections as they provide the basis for our experiments.

While PPO learns from a single sample, which makes it generally applicable, GRPO leverages the fact that we can sample a group of responses G = {y1, . . . , yN} for any given prompt x.

Unlike PPO or GRPO that directly optimize the reward with noisy estimates based on a single sample, DPO optimizes the relation between two samples to match the optimal setup, which can be calculated from data without noise. While this reduced training noise is an advantage, DPO lacks a theoretical guarantee on how a decrease in loss increases the expected reward. Another advantage of DPO, however, is that it does not rely on how the samples are generated, making it appealing for off-policy settings where responses are generated by another model.

2.3 Semi-online Optimization

As described above, GRPO is an on-policy algorithm that requires samples to be generated from the current policy, whereas DPO can learn from off-policy samples (Figure 1). Therefore, the GRPO training pipeline must be online – i.e., the generations and model updates must be synchronous. DPO, on the other hand, was designed for a purely offline setup where we can generate training responses beforehand and train with the DPO loss on these pre-generated responses. However, it is also possible to perform multiple iterations of DPO where one trains on the entire dataset at each iteration, and then generates a new set of responses using the model from the previous iteration. Iterative DPO often offers performance boosts over offline DPO (Xu et al., 2023b; Yuan et al., 2024; Chen et al., 2024b).

In our work we consider a semi-online DPO setup where the generation model parameters are synchronized with the training model parameters only periodically, but potentially much more often than in the iterative setting just described. Let s be a number of parameter update steps performed between each synchronization. Decreasing s will make it more online, and eventually become purely online at s = 1 when responses are generated using the latest model parameters. In our experiments, we bridge the gap between offline and online training by controlling s to see its effect on downstream performance. The advantage of reducing s lies in computation efficiency where responses can be generated in an embarrassingly parallel way.