Why do six different RLVR algorithms converge on similar performance levels?
This explores why a set of distinct RLVR algorithms (different optimizers, reward schemes, advantage functions) tend to land at roughly the same scores — and the corpus's answer is that they're all surfacing what the base model already contains, not building anything new.
This reads the question as: if six algorithms differ in their machinery, why doesn't that difference show up as different performance? The corpus points to a single uncomfortable answer — RLVR mostly redistributes probability mass over reasoning the base model could already produce, so the pretrained model, not the algorithm, sets the ceiling. The clearest statement of this is that RLVR improves *sampling efficiency*, not capability boundaries: at high pass@k the base model actually matches or beats its RLVR-tuned version, meaning the tuning just narrows sampling toward solutions already living in the base distribution Does RLVR actually expand what models can reason about?. If every algorithm is fishing in the same pond, they converge on the same catch.
The mechanism behind that convergence becomes concrete at the parameter level. Across seven RL algorithms and ten model families, RL updates only 5–30% of parameters — and strikingly, those sparse updates are nearly full-rank and *nearly identical across random seeds* Does reinforcement learning update only a small fraction of parameters?. That's structural, not arbitrary: the optimization keeps targeting the same subnetwork regardless of the dice roll. A complementary finding shows RL collapses onto a single dominant format that already existed in pretraining, amplifying one distribution and suppressing the rest within the first epoch Does RL training collapse format diversity in pretrained models?. Different algorithms, same attractor.
The spurious-reward result makes the point almost paradoxically: Qwen2.5-Math gains 16–25% on MATH-500 from *random or even incorrect* rewards, because the reward isn't teaching anything — it's activating latent code-reasoning behavior baked in during pretraining, and Llama and OLMo (lacking that pretraining) get nothing Why do random rewards improve reasoning for some models but not others?. When the reward signal can be near-noise and still work, the algorithm's design clearly isn't the load-bearing variable. The pretraining distribution is.
There's a sharper edge here worth knowing: some of that convergent 'improvement' may not be reasoning at all. RLVR raises trace *coherence* — fewer logical breaks between adjacent steps — without guaranteeing the proof is globally valid Does RLVR actually improve mathematical reasoning or just coherence?, and a chunk of headline benchmark gains turn out to be memorization on contaminated datasets rather than genuine reasoning Does RLVR success on math benchmarks reflect genuine reasoning improvement?. If algorithms converge partly because they're all converging on the same memorized or merely-coherent surface, the plateau is even less about the algorithm.
The interesting corollary — the thing you might not have known you wanted — is what *does* break the plateau. The methods that escape it don't tweak the RL objective; they change what's in the pond. Distillation genuinely transfers new reasoning patterns the base model lacked Does RLVR actually expand what models can reason about?; running supervised imitation *first* to seed new rollouts and then sharpening with RLVR beats either alone Does sequencing imitation then exploration training improve reasoning?; and injecting external data with exploration-rewarding advantage functions counteracts the 'capability boundary collapse' that on-policy RLVR otherwise causes Why does RLVR training narrow a model's problem solving ability?. The pattern across all of them: you move the ceiling by adding new material, not by redesigning the reward. Six algorithms with no new material converge because there's nothing new to converge toward.
Sources 8 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.