INQUIRING LINE

Why does early experience provide better warm-starts for downstream reinforcement learning?

This explores why building up experience or capability *before* the reinforcement-learning phase — a 'warm start' — pays off downstream, and the corpus suggests the answer is that RL mostly sharpens and activates what's already present rather than creating it.


This explores why feeding a model experience early — before the main reinforcement-learning push — makes that later RL more effective. The collection doesn't use the phrase 'warm-start' directly, but several notes converge on a single explanation: RL is largely an *activation and sharpening* process, not a *creation* process, so whatever you plant early determines the ceiling of what RL can reach.

The sharpest version of this comes from work on what reinforcement learning actually does. RLVR (reinforcement learning with verifiable rewards) turns out not to expand the set of problems a model can solve — at high sampling counts the base model matches or beats the RL-tuned one — meaning RL narrows sampling toward solutions the base distribution *already contained* Does RLVR actually expand what models can reason about?. A companion finding shows a single training example can trigger this activation, and even spurious rewards work nearly as well, as long as the right capability was pretrained in What does reward learning actually do to model reasoning?. If RL can only amplify what's latent, then the quality of the early experience *is* the quality of the warm-start: you're not teaching new skills downstream, you're surfacing ones already there.

The order in which capability arrives also matters. RL training moves through two phases — first consolidating procedural execution correctness, then shifting the bottleneck to strategic planning Does RL training follow a predictable two-phase learning sequence?. That sequence implies a warm-start that has already nailed the procedural layer lets RL spend its budget on the harder strategic phase instead of relearning the basics. Relatedly, work on chain-of-thought shows reasoning can be 'planted earlier' — treated as exploratory action *during pretraining* with an information-gain reward — and this lifts downstream math and science performance by ~19% Can chain-of-thought reasoning be learned during pretraining itself?. Earlier planting, better downstream behavior.

There's a structural reason this works so cleanly. RL touches surprisingly little of the network: across seven algorithms and ten model families it updates only 5–30% of parameters, and those updates are nearly identical across random seeds — structural, not arbitrary Does reinforcement learning update only a small fraction of parameters?. A process that adjusts a small, consistent subnetwork is well-suited to *refining* a good starting point and poorly suited to building capability from scratch — which is exactly why the starting point carries so much weight.

The lateral payoff for a curious reader: 'warm-start' isn't just an efficiency trick. Two different lines of work — one showing RL doesn't push past base-model boundaries, another showing knowledge embeds better when reasoning quality is internalized first Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning? — point to the same uncomfortable idea: the most important learning may happen *before* the reinforcement learning ever starts, and RL is mostly there to find what early experience already put in place.


Sources 6 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about warm-start dynamics in RL for language models. The question remains open: *Why does early experience provide better warm-starts for downstream reinforcement learning?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat these as anchors to verify, not settled fact.
- RL does not expand reasoning beyond the base model's latent capabilities; it *activates and sharpens* what pretraining already embedded (~2025).
- A single training example can trigger capability activation; spurious rewards nearly match structured ones if the skill was pretrained (~2025).
- RL training unfolds in two phases: procedural consolidation, then strategic planning; early warm-starts that nail procedure let downstream RL focus on strategy (~2025).
- Chain-of-thought planted as exploratory action during pretraining (with information-gain reward) lifts downstream math/science by ~19% (~2025).
- RL updates only 5–30% of parameters across ten model families; updates are nearly identical across seeds, indicating sparse, structural refinement rather than broad learning (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2507.14843 *The Invisible Leash: Why RLVR May Not Escape Its Origin* (2025-07)
- arXiv:2505.11711 *Reinforcement Learning Finetunes Small Subnetworks in Large Language Models* (2025-05)
- arXiv:2509.20162 *Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation* (2025-09)
- arXiv:2510.01265 *RLP: Reinforcement as a Pretraining Objective* (2025-10)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above—latent-only activation, two-phase RL dynamics, sparse parameter updates—ask whether newer models (o1, r1 variants, reasoning-focused architectures), training methods (mixture-of-experts RL, curriculum-scheduled rewards), tooling (long-context memory, retrieval-augmented RL), or evaluation benchmarks have since *relaxed or overturned* the claim. Separate the durable core question (why early quality matters) from the perishable limitation (what RL can or cannot do). Cite what broke the constraint, if anything; flag where it still holds.
(2) **Surface contradicting or superseding work.** In the last ~6 months, have papers emerged showing RL *does* push past base boundaries, or that late-stage training recovers from poor warm-starts? Highlight the sharpest tension.
(3) **Propose two research questions that assume the regime has moved:** one probing whether *orchestration* (multi-turn RL, meta-learned schedulers, hierarchical rewards) can decouple downstream RL from early experience, and one asking whether *architecture* changes (e.g., dynamic routing, gating) alter which parameters RL touches and thus which early signals matter.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines