Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

Paper · arXiv 2508.08221 · Published August 11, 2025
Reinforcement LearningRLVR

Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

Aligned with classic RL mechanism analysis methodologies (Andrychowicz et al., 2020; Engstrom et al., 2020; Huang et al., 2024a), we systematically review the widely used RL techniques by reproducing them and independently evaluating the actual impact of each technique, based on the same open-source infrastructure framework and policy models. To comprehensively cover practical scenarios, we design extensive experimental settings incorporating datasets of varying difficulty levels, diverse model sizes, and distinct model types. Furthermore, we conduct an in-depth analysis of their theoretical foundations, implementation details, and applicable scenarios as demons. The intuitive contribution is illustrated in Figure 1. Specifically, ❶ our empirical results reveal that most RL techniques exhibit obvious preferences and sensitivities to the experimental setup, e.g., model type, data distribution, reward mechanism and hyperparameter. ❷ Based on the isolated analysis under our setup, we demonstrate that employing only two techniques, i.e., advantage normalization (group-level mean, batch-level std) and token-level loss aggregation, can unlock the learning capability of critic-free policies using vanilla PPO loss, surpassing mainstream RL4LLM algorithms incorporating redundant components.

  1. Group-level normalization shows robust efficiency under each reward setting. Batch-level normalization provides more stable improvement under large scale reward setting. (§4.1.1)

  2. Group-level mean and batch-level standard deviation enable further robust normalization. (§4.1.3)

  3. Clip Higher prefers promoting high-quality exploration for aligned models. (§4.2.1)

  4. There appears to be a “scaling law” between the performance and the upper bound of the clipping on the small-sized model. (§4.2.3)

  5. Compared to sequence-level loss aggregation, token-level aggregation is effective on base models, while showing limited improvement on aligned models. (§4.3.1)

  6. Overlong filtering enhances accuracy and clarity for short-to-medium reasoning tasks but provides limited benefits for long-tail reasoning. (§4.4.1)

  7. Two techniques may unlock learning capacity in critic-free policies based on vanilla PPO loss. (§5)

Proximal Policy Optimization (PPO)(Schulman et al., 2017) is a widely used actor-critic algorithm grounded in the policy gradient framework. It improves the stability of policy learning by optimizing a clipped surrogate objective that restricts the divergence between the new and old policies during training.

Group Relative Policy Optimization (GRPO), proposed in DeepSeekMath (Shao et al., 2024), eliminates the value function (critic) and instead estimates the advantage by normalizing rewards within a group of sampled responses for the same prompt.

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) (Yu et al., 2025) is a recent RL method designed to address the unique challenges in LLM reasoning. For each question q with gold answer a, DAPO samples a group of G outputs {oi}G i=1 from the old policy, computes their rewards, and maximizes the following surrogate objective:

A variety of practical techniques have been introduced to stabilize optimization, reduce variance, and accelerate convergence of LLM on the reasoning task. Drawing from prior research and practical implementations, we categorize widely used techniques as follows.

Baseline Design. Baselines are crucial for reducing variance in policy gradient estimation. Recent studies have proposed more effective formulations, such as using the mean reward within each group as the baseline (Shao et al., 2024) and computing the baseline for each sample as the average gradient estimate from other samples in the group (Ahmadian et al., 2024; Kool et al., 2019).

Clipping Strategies. Clipping controls excessive updates in policy optimization and can be applied to different quantities, such as rewards, advantages, or ratios. Furthermore, the Clip Ratio Higher (Yu et al., 2025) method relaxes the upper bound in PPO’s ratio clipping to better preserve exploration.

Normalization Strategies. Normalization of rewards or advantages helps stabilize gradient magnitudes. Representative approaches include: Batch-level Reward Normalization (Hu et al., 2025), Group-level Reward Normalization (Shao et al., 2024; Ahmadian et al., 2024), and Reward Shift without Standard Deviation (Liu et al., 2025a), which removes the standard deviation term to avoid the difficulty bias.

Filtering Strategies. Filtering out uninformative or undesirable samples prior to gradient computation. Examples include: Overlong Filtering (Yu et al., 2025) to remove responses exceeding predefined length limits; Error Max Clip Mask and Right Min Clip Mask to filter overly incorrect or trivially correct samples; and Difficulty Mask (Yu et al., 2025; Zhang et al., 2025; Chu et al., 2025) to exclude samples outside a targeted difficulty range.

Loss Aggregation Granularity. The formulation of loss aggregation determines the relative weight each token contributes to the overall objective. Common approaches include: Token-level Loss computes per-token advantages to reduce length bias, while Sequence-level Loss aggregates at the sequence level.

Additional Loss Functions. Auxiliary losses can complement the primary objective and regularize training. KL Loss (Yu et al., 2025; Liu et al., 2025a) constrains divergence from a reference policy, while SFT Loss (Zhang and Zuo, 2025) incorporates supervised fine-tuning objectives to preserve alignment.

Reward Design. Shaping the reward function can guide desired output properties. Common examples include: Length Penalty discourages excessively long outputs; Formatting Reward which encourages outputs that adhere to preferred structures such as boxed answers, bullet lists, or code-style formatting; Length-Dependent Accuracy Reward combines correctness with output length.

The above categories summarize the most prevalent improvement strategies for RL in LLM reasoning. In this work, we focus on four key aspects: Normalization, Clipping, Masking, and Loss Aggregation, and conduct in-depth analyses of their mechanisms and practical utility.