Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

Paper · arXiv 2508.07629 · Published August 11, 2025
Reinforcement LearningReasoning Methods CoT ToT

We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model’s exploration capacity but also improves its efficiency in learning from negative samples.

We discover that for long Chain-of-Thought supervised fine-tuning (long CoT SFT), a compact set of high-quality data sources proves significantly more effective than larger, more diverse datasets, as high-quality examples ensure consistent learning of accurate reasoning patterns. Then, we find that simple SFT samples that are not filtered for correctness can easily interfere with the model, but difficult samples do not appear to harm the model’s performance even if they are not filtered. In fact, these difficult errors may even promote the model’s exploration and are beneficial to its performance.

For reinforcement learning, clipping importance sampling is a commonly used technique. The clipping mechanism primarily controls the magnitude of policy model updates to ensure training stability. Through in-depth analysis, we identify two issues with the current clipping mechanism (Schulman et al., 2017):

• High-entropy token clipping. Among the tokens clipped beyond the upper threshold 1 + ϵ of importance sampling, there exist high-entropy tokens that often correspond to valuable exploratory behaviors at critical decision points. Directly clipping these tokens may lead to premature termination of exploration, adversely affecting the model’s post-convergence performance. Although DAPO (Yu et al., 2025) proposes Clip-Higher to mitigate this issue by adjusting the upper importance sampling threshold to 1 + ϵh, high-entropy tokens exceeding this threshold still face the same problem.

• Delayed convergence of negative samples. When the importance sampling ratio of suboptimal trajectories falls below 1 − ϵ, their gradients are forcibly truncated, preventing the model from updating based on these signals, which in turn slows down convergence.

To address the aforementioned two issues, we propose a Gradient-Preserving clipping Policy Optimization (GPPO) that does not discard the gradients of any tokens. Even for truncated tokens, they are still included in the computational graph of backpropagation and participate in gradient computation. The gradients propagated back from these tokens by GPPO can be proven to be bounded and mild. This mechanism strikes a balance between maintaining training stability and preserving valuable gradient information. Our experimental results demonstrate that, compared to using Clip-Higher, GPPO achieves superior and more stable performance.