Does RL training follow predictable scaling curves?

Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.

Note · 2026-02-23 · sourced from Inference time scaling

The first large-scale systematic study of RL scaling for LLMs (400K+ GPU-hours, 200+ models) establishes that RL training follows sigmoidal compute-performance curves. This is the RL equivalent of Chinchilla-style scaling laws for pretraining: given enough data points, you can predict where a training run will plateau before spending the full compute budget.

The critical two-tier finding separates RL design choices into two categories:

Asymptote-setting choices — these determine the performance ceiling. Not all RL recipes converge to the same asymptotic performance. The specific combination of reward design, data composition, and training structure sets a fundamentally different ceiling. Small-scale experiments that use the wrong recipe will predict the wrong ceiling.

Efficiency-modulating details — loss aggregation method, normalization scheme, curriculum design, and off-policy algorithm primarily affect how quickly the model reaches its asymptote, not where that asymptote sits. These are "how fast" knobs, not "how good" knobs.

The practical value: stable, scalable recipes follow predictable trajectories that enable reliable extrapolation from smaller runs. This means researchers can evaluate whether a recipe is promising by running small-scale experiments and fitting the sigmoid, rather than committing to full-scale training. The ScaleRL "best-practice recipe" was validated by successfully predicting performance on a single 100K GPU-hour run.

This refines Does the choice of RL algorithm actually matter for reasoning?: at the algorithm level (PPO vs Expert Iteration vs RC-RL), choice is interchangeable. But at the recipe level (which includes data, reward structure, and training configuration), choice matters for the asymptote. The algorithm-interchangeability finding operates within a recipe; recipe selection sets the ceiling that all algorithms within it approach.

The sigmoid framework also provides the mathematical structure for Does policy entropy collapse limit reasoning performance in RL?: entropy collapse IS the approach to sigmoid saturation. The sigmoid curve predicts when collapse will occur, making the previously unpredictable bottleneck forecastable.

Source: Inference time scaling

Related concepts in this collection

Does the choice of RL algorithm actually matter for reasoning? Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
refines: algorithm interchangeability holds within a recipe, but recipe-level choices set different asymptotic ceilings
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the sigmoid saturation IS entropy collapse approaching the asymptote; ScaleRL provides predictive framework for when it occurs
Does RL training follow a predictable two-phase learning sequence? This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
ScaleRL's sigmoid may aggregate over these phases; the two-phase dynamic could explain the inflection point of the sigmoid
Can extended RL training discover reasoning strategies base models cannot? Does reinforcement learning genuinely expand what models can reason about, or does it only optimize existing latent capabilities? ProRL tests this by running RL longer on diverse tasks with better training controls.
recipes that set higher asymptotes may enable access to novel strategies that lower-asymptote recipes cannot reach

Concept map

14 direct connections · 106 in 2-hop network ·medium cluster

Does RL training follow predictable scaling curv… Does the choice of RL algorithm actually matter fo… Does policy entropy collapse limit reasoning perfo… Does RL training follow a predictable two-phase le… Can extended RL training discover reasoning strate…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

rl training scaling follows predictable sigmoid trajectories — recipe asymptotes differ while implementation details only modulate efficiency