Can PPO match GRPO and DAPO with just two techniques?
This explores whether plain PPO can be made competitive with fancier RL algorithms (GRPO, DAPO) using a couple of targeted fixes — and what that says about where the real gains in RL-for-reasoning actually come from.
This explores whether plain PPO can be made competitive with fancier RL algorithms (GRPO, DAPO) using a couple of targeted fixes — and the short answer the corpus gives is yes. The headline result is that two techniques — advantage normalization and token-level loss aggregation — let a critic-free version of vanilla PPO not just match but in places surpass the more elaborate algorithms it's usually compared against Can two simple techniques match complex RL algorithms?. The deeper takeaway buried in that finding is more interesting than the bake-off itself: most RL techniques turn out to be setup-sensitive, and what actually sets the performance ceiling is the pretrained prior, not the choice of optimizer.
That reframing connects to a striking parallel result in the collection. When you compare Expert Iteration, PPO, and other RL variants on reasoning tasks, they perform comparably — because exploration is bounded by the model's pretrained distribution, not by the cleverness of the algorithm Does the choice of RL algorithm actually matter for reasoning?. The argument there is that RL for reasoning functions more like *selection* than *discovery*: the optimizer is mostly surfacing solutions the base model already latently contains. If that's true, it explains why two small techniques can close the gap — the gap was never as large as the algorithm names suggested, because none of them are inventing new reasoning ability.
There's a useful counterpoint to keep the picture honest. Not every algorithmic choice is cosmetic — some structural changes genuinely add signal the base model can't supply on its own. Tree-GRPO, for instance, uses branching rollout structure to convert trajectory-level outcome rewards into step-level process supervision, getting credit assignment that flat algorithms can't Can tree structure alone convert outcome rewards into process supervision?. Similarly, methods that turn rich environment feedback into dense gradient signals change what the policy can learn from, not just how it's optimized Can environment feedback replace scalar rewards in policy learning?. So "algorithm choice barely matters" holds for the family of policy-gradient variants competing on the same scalar reward — it's less true once a method changes the *shape* of the reward signal itself.
Finally, the collection offers a lens on *why* these methods are so interchangeable in the first place. The work tracing DPO and PPO-Clip back to prospect theory argues they succeed because they implicitly mirror the same structure of human decision-making — loss aversion and reference-dependent utility — so different surface formulations end up encoding nearly the same objective Why do alignment methods work if they model human irrationality?. If the algorithms are all approximating one underlying thing, it stops being surprising that a stripped-down PPO with two well-chosen techniques lands in the same place. The lesson for a practitioner: spend your effort on the prior and the reward signal's structure, not on chasing the latest acronym.
Sources 5 notes
Advantage normalization and token-level loss aggregation allow critic-free PPO to surpass more complex algorithms. Systematic evaluation shows most RL techniques are setup-sensitive; the pretrained prior, not algorithm choice, sets performance ceiling.
Expert Iteration, PPO, and RC-RL perform comparably on reasoning because exploration is constrained by the pretrained distribution, not the optimizer. RL functions as selection, not discovery—the prior contains most solutions the algorithm will find.
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.
KTO formalizes what DPO and PPO-Clip do implicitly: they succeed because they mirror prospect theory's structure of human decision-making. Binary utility signals suffice and outperform pairwise preferences when pretrained models are strong.