rStar2-Agent: Agentic Reasoning Technical Report

Paper · arXiv 2508.20722 · Published August 28, 2025
Reward ModelsReinforcement LearningRLVR

We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks.

To move beyond merely “thinking longer”, we aim to enable models to “think smarter” by developing more advanced cognitive abilities that autonomously utilize the right tools to reason, validate, and learn from the feedback signals provided by the tool environment. We incentivize these abilities through agentic reinforcement learning, where the model interacts with tools insides the dedicated tool environment and adapts its reasoning based on the feedback it receives. Crucially, not all tools or environments are equally effective; a valuable environment must be deployable and provide accurate, verifiable signals that guide the model toward stronger reasoning paths. In this work, we focus on Python coding tools and the interpreter as the environment for agentic reinforcement learning. Python coding tools broaden the model’s action space, enabling exploration of alternative solutions and verification of intermediate steps, thereby complementing internal self-reflection when long CoT alone is insufficient.

However, effectively scaling agentic reinforcement learning poses significant challenges. First, the inherent complexity of coding tools and Python interpreter introduces environment noise into the reasoning process. When the model inevitably generates syntactically or logically incorrect code, the resulting environment feedback (e.g., error message) can cause it to waste valuable tokens correcting mistakes rather than advancing reasoning. Unfortunately, current RL methods [Shao et al., 2024, Guo et al., 2025], which rely primarily on outcome-only rewards, exacerbates this issue because trajectories with failed intermediate tool calls still receive positive reward if the final answer is correct. As a result, the model treats errors as acceptable and produces lengthy, low-quality reasoning trajectory. Second, large-scale agentic RL training imposes substantial infrastructure demands. A single training batch can trigger tens of thousands of concurrent tool calls, making it challenging to construct a reliable and responsive code execution environment. Moreover, agentic rollouts with environment interactions amplify the rollout inefficiencies in standard RL systems, significantly slowing the overall training process.

Second, to enable effective agentic reinforcement learning in a code environment, we propose Group Relative Policy Optimization with Resampling on Correct (GRPO-RoC), which integrates GRPO with a Resample-On-Correct (RoC) rollout strategy to address environment-induced noise under sparse, outcome-only rewards. Specifically, RoC first oversamples a larger group of rollouts and then down-samples to the standard batch size. Positive trajectories are filtered to retain only the highest-quality ones with minimal tool-induced errors or formatting issues, while negative trajectories are uniformly down-sampled. This simple yet effective asymmetric sampling preserves diverse failure modes as informative negative signals while emphasizing higher-quality success cases for positive supervision. Compared to methods that explicitly penalize tool-use errors in the reward function [Qian et al., 2025, Li et al., 2025, Kimi], GRPO-RoC improves training stability and avoids reward-hacking risks. By learning from cleaner, higher-quality positive trajectories, the model not only improves Python coding tool usage but also exhibits advanced cognitive abilities, reasoning more effectively and concisely under realistic code-environment interactions. Finally, we present our training recipe that boosts a 14B pre-trained base model to frontier-level math reasoning with minimal compute. Unlike prior works that apply reasoning-heavy SFT before RL [Liu et al., 2025a, Feng et al., 2025, Team, 2025, Seed et al., 2025], we begin with a non-reasoning SFT stage solely to instill general instruction-following, coding tool usage, and formatting, without enhancing reasoning. This avoids potential SFT overfitting and keeps initial average responses short, allowing RL to more effectively cultivate reasoning while fully exploiting the model’s pre-trained capability. We then conduct multi-stage RL training with GRPO-RoC, gradually increasing task difficulty and maximum training length. Unlike prior RL methods that heavily scale rollouts to 16K→ 48K or more [Chen et al., 2025, Xiaomi et al., 2025], we limit each stage to shorter lengths (8K→12K). This significantly reduces RL costs while encouraging more efficient reasoning strategies. With only 510 RL steps, the model rapidly achieves frontier-level math reasoning, demonstrating both high capability and exceptional training efficiency.