How does modified PPO handle samples from much older model versions?
This reads as a question about how PPO copes with 'stale' training data — samples generated by an earlier version of the model that has since drifted away from the policy being updated (the off-policy problem that PPO's clipping is built to manage).
This explores how PPO handles off-policy samples — generations collected from an older model version that no longer matches the current policy — and the honest answer is that the corpus addresses this territory sideways rather than head-on. There isn't a note here dedicated to importance-sampling ratios or staleness windows specifically. But several notes converge on *why* the question matters and what actually governs the answer, which is more useful than a mechanical reply.
The first reframe: in reasoning RL, the optimizer matters far less than people assume. Does the choice of RL algorithm actually matter for reasoning? argues that PPO, Expert Iteration, and other variants perform comparably because exploration is bounded by the pretrained prior, not by the update rule — RL acts as *selection* from what the base model already contains, not discovery. If that's true, then how PPO weights stale-vs-fresh samples is a second-order knob compared to whether the samples fall inside the model's reachable distribution at all. Can two simple techniques match complex RL algorithms? reinforces this from the engineering side: the two things that let plain PPO match fancier algorithms are advantage normalization and token-level loss aggregation, and 'most RL techniques are setup-sensitive' — meaning a modification's value depends heavily on context rather than holding universally.
The more interesting lateral connection is that a sample from an *older model version* is exactly a sample whose informativeness has shifted. How does model ability change what samples teach? shows that a sample's learning value isn't fixed — it depends on the interaction between its difficulty and the model's *current* ability, and the productive band drifts within a few training steps. So an old sample isn't just 'noisier'; it may now be too easy or too hard relative to where the policy has moved, which is a sharper way to think about staleness than raw policy divergence.
This matters because mishandling such samples isn't neutral — it actively degrades the model. Do overly hard RLVR samples actually harm model capabilities? shows that when normalization treats rare accidental successes as high-advantage trajectories, the model learns shortcuts (answer repetition, computation-skipping) that contaminate capabilities it already had. Group-relative normalization is precisely where a stale, misvalued sample can get amplified into a degenerate update. So the real risk in 'older samples' isn't lost signal — it's reinforced bad signal.
If you want to go one layer deeper into *why* PPO-style clipping works at all, Why do alignment methods work if they model human irrationality? reframes PPO-Clip as implicitly modeling human loss-aversion — the clip is doing something structurally meaningful, not just bounding variance. The thread running through all of these: with a strong pretrained prior, the question of how to weight old samples collapses into the broader question of which samples are worth trusting at the model's current state — and getting that wrong costs you capabilities you already had.
Sources 5 notes
Expert Iteration, PPO, and RC-RL perform comparably on reasoning because exploration is constrained by the pretrained distribution, not the optimizer. RL functions as selection, not discovery—the prior contains most solutions the algorithm will find.
Advantage normalization and token-level loss aggregation allow critic-free PPO to surpass more complex algorithms. Systematic evaluation shows most RL techniques are setup-sensitive; the pretrained prior, not algorithm choice, sets performance ceiling.
A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
KTO formalizes what DPO and PPO-Clip do implicitly: they succeed because they mirror prospect theory's structure of human decision-making. Binary utility signals suffice and outperform pairwise preferences when pretrained models are strong.