Reinforcement Learning for LLMs

Does RL training follow predictable scaling curves?

Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.

Note · 2026-02-23 · sourced from Inference time scaling
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The first large-scale systematic study of RL scaling for LLMs (400K+ GPU-hours, 200+ models) establishes that RL training follows sigmoidal compute-performance curves. This is the RL equivalent of Chinchilla-style scaling laws for pretraining: given enough data points, you can predict where a training run will plateau before spending the full compute budget.

The critical two-tier finding separates RL design choices into two categories:

Asymptote-setting choices — these determine the performance ceiling. Not all RL recipes converge to the same asymptotic performance. The specific combination of reward design, data composition, and training structure sets a fundamentally different ceiling. Small-scale experiments that use the wrong recipe will predict the wrong ceiling.

Efficiency-modulating details — loss aggregation method, normalization scheme, curriculum design, and off-policy algorithm primarily affect how quickly the model reaches its asymptote, not where that asymptote sits. These are "how fast" knobs, not "how good" knobs.

The practical value: stable, scalable recipes follow predictable trajectories that enable reliable extrapolation from smaller runs. This means researchers can evaluate whether a recipe is promising by running small-scale experiments and fitting the sigmoid, rather than committing to full-scale training. The ScaleRL "best-practice recipe" was validated by successfully predicting performance on a single 100K GPU-hour run.

This refines Does the choice of RL algorithm actually matter for reasoning?: at the algorithm level (PPO vs Expert Iteration vs RC-RL), choice is interchangeable. But at the recipe level (which includes data, reward structure, and training configuration), choice matters for the asymptote. The algorithm-interchangeability finding operates within a recipe; recipe selection sets the ceiling that all algorithms within it approach.

The sigmoid framework also provides the mathematical structure for Does policy entropy collapse limit reasoning performance in RL?: entropy collapse IS the approach to sigmoid saturation. The sigmoid curve predicts when collapse will occur, making the previously unpredictable bottleneck forecastable.


Source: Inference time scaling

Related concepts in this collection

Concept map
14 direct connections · 106 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl training scaling follows predictable sigmoid trajectories — recipe asymptotes differ while implementation details only modulate efficiency