Reinforcement Learning for LLMs

Can vanilla PPO match specialized reasoning algorithms with just two techniques?

Does a minimalist combination of advantage normalization and token-level loss aggregation enable critic-free PPO to compete with more complex algorithms like GRPO and DAPO in language model reasoning tasks?

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

The RL-for-LLM-reasoning field has produced a zoo of algorithms (GRPO, DAPO, GPPO, GFPO) each adding techniques atop PPO: clip-higher, dynamic sampling, overlong filtering, difficulty masking, KL loss, SFT loss. But which techniques actually matter? Systematic isolated evaluation within a unified framework reveals that most RL techniques exhibit obvious preferences and sensitivities to experimental setup — model type, data distribution, reward mechanism, and hyperparameters.

The key finding: employing only two techniques — advantage normalization (group-level mean, batch-level standard deviation) and token-level loss aggregation — unlocks the learning capability of critic-free policies using vanilla PPO loss. This minimalist combination consistently improves performance, surpassing strategies like GRPO and DAPO that incorporate many additional components.

Specific technique findings: (1) group-level normalization shows robust efficiency across reward settings; (2) batch-level normalization provides more stable improvement at larger reward scales; (3) combining group-level mean with batch-level std enables robust normalization; (4) token-level aggregation is effective on base models but shows limited improvement on already-aligned models; (5) overlong filtering helps short-to-medium reasoning but not long-tail reasoning.

This strongly reinforces the existing insight that Does the choice of RL algorithm actually matter for reasoning?. If vanilla PPO with two techniques matches or surpasses GRPO and DAPO, then the algorithmic innovation in the current RL-for-reasoning literature is largely engineering optimization, not fundamental capability improvement. The pretrained prior determines what's achievable; the algorithm determines how efficiently you get there, with diminishing returns from complexity.


Source: Reinforcement Learning

Related concepts in this collection

Concept map
13 direct connections · 106 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

two techniques unlock critic-free ppo matching grpo and dapo — advantage normalization and token-level loss aggregation