Why does early experience provide better warm-starts for downstream reinforcement learning?
This explores why building up experience or capability *before* the reinforcement-learning phase — a 'warm start' — pays off downstream, and the corpus suggests the answer is that RL mostly sharpens and activates what's already present rather than creating it.
This explores why feeding a model experience early — before the main reinforcement-learning push — makes that later RL more effective. The collection doesn't use the phrase 'warm-start' directly, but several notes converge on a single explanation: RL is largely an *activation and sharpening* process, not a *creation* process, so whatever you plant early determines the ceiling of what RL can reach.
The sharpest version of this comes from work on what reinforcement learning actually does. RLVR (reinforcement learning with verifiable rewards) turns out not to expand the set of problems a model can solve — at high sampling counts the base model matches or beats the RL-tuned one — meaning RL narrows sampling toward solutions the base distribution *already contained* Does RLVR actually expand what models can reason about?. A companion finding shows a single training example can trigger this activation, and even spurious rewards work nearly as well, as long as the right capability was pretrained in What does reward learning actually do to model reasoning?. If RL can only amplify what's latent, then the quality of the early experience *is* the quality of the warm-start: you're not teaching new skills downstream, you're surfacing ones already there.
The order in which capability arrives also matters. RL training moves through two phases — first consolidating procedural execution correctness, then shifting the bottleneck to strategic planning Does RL training follow a predictable two-phase learning sequence?. That sequence implies a warm-start that has already nailed the procedural layer lets RL spend its budget on the harder strategic phase instead of relearning the basics. Relatedly, work on chain-of-thought shows reasoning can be 'planted earlier' — treated as exploratory action *during pretraining* with an information-gain reward — and this lifts downstream math and science performance by ~19% Can chain-of-thought reasoning be learned during pretraining itself?. Earlier planting, better downstream behavior.
There's a structural reason this works so cleanly. RL touches surprisingly little of the network: across seven algorithms and ten model families it updates only 5–30% of parameters, and those updates are nearly identical across random seeds — structural, not arbitrary Does reinforcement learning update only a small fraction of parameters?. A process that adjusts a small, consistent subnetwork is well-suited to *refining* a good starting point and poorly suited to building capability from scratch — which is exactly why the starting point carries so much weight.
The lateral payoff for a curious reader: 'warm-start' isn't just an efficiency trick. Two different lines of work — one showing RL doesn't push past base-model boundaries, another showing knowledge embeds better when reasoning quality is internalized first Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning? — point to the same uncomfortable idea: the most important learning may happen *before* the reinforcement learning ever starts, and RL is mostly there to find what early experience already put in place.
Sources 6 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.