Can vanilla PPO match specialized reasoning algorithms with just two techniques?
Does a minimalist combination of advantage normalization and token-level loss aggregation enable critic-free PPO to compete with more complex algorithms like GRPO and DAPO in language model reasoning tasks?
The RL-for-LLM-reasoning field has produced a zoo of algorithms (GRPO, DAPO, GPPO, GFPO) each adding techniques atop PPO: clip-higher, dynamic sampling, overlong filtering, difficulty masking, KL loss, SFT loss. But which techniques actually matter? Systematic isolated evaluation within a unified framework reveals that most RL techniques exhibit obvious preferences and sensitivities to experimental setup — model type, data distribution, reward mechanism, and hyperparameters.
The key finding: employing only two techniques — advantage normalization (group-level mean, batch-level standard deviation) and token-level loss aggregation — unlocks the learning capability of critic-free policies using vanilla PPO loss. This minimalist combination consistently improves performance, surpassing strategies like GRPO and DAPO that incorporate many additional components.
Specific technique findings: (1) group-level normalization shows robust efficiency across reward settings; (2) batch-level normalization provides more stable improvement at larger reward scales; (3) combining group-level mean with batch-level std enables robust normalization; (4) token-level aggregation is effective on base models but shows limited improvement on already-aligned models; (5) overlong filtering helps short-to-medium reasoning but not long-tail reasoning.
This strongly reinforces the existing insight that Does the choice of RL algorithm actually matter for reasoning?. If vanilla PPO with two techniques matches or surpasses GRPO and DAPO, then the algorithmic innovation in the current RL-for-reasoning literature is largely engineering optimization, not fundamental capability improvement. The pretrained prior determines what's achievable; the algorithm determines how efficiently you get there, with diminishing returns from complexity.
Source: Reinforcement Learning
Related concepts in this collection
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
directly supports: even simpler than expected — two techniques suffice
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
connects: advantage normalization and token-level loss may work precisely because they manage entropy dynamics
-
Does RL training collapse format diversity in pretrained models?
Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
extends: the format convergence may be inevitable regardless of algorithm, which is why algorithm choice doesn't matter much
-
Can RL training run while generation continues without waiting?
Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?
complementary PPO simplification: AReaL modifies PPO for staleness tolerance in asynchronous training; this note shows PPO needs only two techniques for reasoning performance — together they suggest the PPO framework is more robust and adaptable than the proliferation of replacement algorithms implies
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
two techniques unlock critic-free ppo matching grpo and dapo — advantage normalization and token-level loss aggregation