Why do correct code trajectories teach models to tolerate errors?

Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.

Note · 2026-02-22 · sourced from Reward Models

When language models learn to use coding tools during RL training, the code environment introduces a specific form of noise that standard outcome-based RL cannot handle. The model inevitably generates syntactically or logically incorrect code during reasoning, producing error messages and wasted tokens on correction. Under standard GRPO (which uses only outcome rewards), trajectories with failed intermediate tool calls still receive positive reward if the final answer is correct. The model learns that code errors are acceptable — producing lengthy, low-quality reasoning trajectories with unnecessary error-correction loops.

rStar2-Agent (2025) proposes GRPO-RoC (Resampling on Correct), which applies asymmetric filtering:

Oversample — generate a larger group of rollouts than the standard batch size
Filter positive trajectories — from correct-answer rollouts, retain only those with minimal tool-induced errors or formatting issues (the cleanest successes)
Downsample negative trajectories uniformly — preserve diverse failure modes as informative negative signal

The asymmetry is deliberate. Positive trajectories need quality filtering because the model should learn from clean reasoning, not from "stumbled to the right answer despite multiple code crashes." Negative trajectories need diversity preservation because understanding many ways to fail is more informative than understanding one failure mode well.

This connects to Does step-level confidence outperform global averaging for trace filtering? — both approaches recognize that not all correct trajectories are equally valuable for learning. It also extends Does RL training follow a predictable two-phase learning sequence? — tool use is a procedural capability that must consolidate (clean tool usage) before strategic reasoning can effectively build on it.

The results are striking: a 14B model reaches frontier-level math reasoning in only 510 RL steps within one week (64 MI300X GPUs), achieving 80.6% on AIME24 and 69.8% on AIME25 — surpassing DeepSeek-R1 (671B) with significantly shorter responses. The training recipe starts with non-reasoning SFT (instruction following + code tool usage + formatting only, no reasoning enhancement) to avoid SFT overfitting, then applies multi-stage RL with increasing difficulty and maximum length.

Source: Reward Models — rStar2-Agent: Agentic Reasoning Technical Report (arxiv 2508.20722)

Related concepts in this collection

Does step-level confidence outperform global averaging for trace filtering? Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.
related quality-filtering principle applied at step level
Does RL training follow a predictable two-phase learning sequence? This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
tool use as procedural capability that must consolidate before strategic reasoning
Why do correct reasoning traces contain fewer tokens? In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
GRPO-RoC's filtered positive trajectories are cleaner and shorter, consistent with this finding
Why does SFT-then-RL training follow a predictable three-phase pattern? When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.
rStar2's non-reasoning SFT avoids the overfitting phase by not injecting reasoning patterns
Can reinforcement learning scale beyond single-turn language tasks? Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.
complementary agentic RL approach: rStar2 solves trajectory quality in code-tool environments through asymmetric filtering, while SWE-RL solves long-horizon credit assignment in multi-turn code tasks — together they address the two key challenges (noisy intermediate steps and sparse delayed rewards) that make agentic code RL harder than single-turn reasoning RL

Concept map

15 direct connections · 152 in 2-hop network ·dense cluster

Why do correct code trajectories teach models to… Does step-level confidence outperform global avera… Does RL training follow a predictable two-phase le… Why do correct reasoning traces contain fewer toke… Why does SFT-then-RL training follow a predictable… Can reinforcement learning scale beyond single-tur…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

agentic rl with code tools requires asymmetric trajectory filtering because environment noise in correct trajectories teaches the model to tolerate errors