Which recipe choices determine the asymptotic ceiling in RL training?

This explores what the corpus means by the 'asymptotic ceiling' of RL training — the upper bound a run converges toward — and which design decisions (reward shape, data difficulty, task ordering, format) set that ceiling versus merely affecting how fast you reach it.

This explores what fixes the upper bound of an RL run, not how quickly you get there. The anchor finding is that RL performance scales sigmoidally, and in a 400K-GPU-hour study the *recipe* sets the asymptote while implementation details only move efficiency — you can extrapolate a small run's ceiling once the recipe is fixed Does RL training follow predictable scaling curves?. So the real question becomes: which ingredients in the recipe are ceiling-setting? The corpus points at four, and they're more about reward design and data than about the optimizer.

Reward shape is the first lever. Binary correctness rewards quietly cap calibration — they pay off confident guessing and never punish confident wrong answers — but adding a proper scoring rule (Brier) jointly optimizes accuracy and calibration with no trade-off, effectively raising the ceiling rather than the speed Does binary reward training hurt model calibration?. On unverifiable tasks, the choice of *which statistic* becomes the reward matters too: cross-rollout variance can serve as both a dense token-level signal and a query filter, buying stability that lets training climb further Can one statistical measure serve dual purposes in RL training?. And where you have no human labels at all, tree-search outcomes can manufacture process-level reward, changing what the ceiling is even made of Can tree search replace human feedback in LLM training?.

Data difficulty is the second, and it's a trap. Training on nearly-impossible problems doesn't just stall — it actively lowers the ceiling, because group-relative normalization treats rare accidental successes as high-advantage and reinforces shortcuts (answer repetition, skipped computation) that then contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. Task *ordering* is a quieter knob with the same flavor: structured domains shrink output entropy while creative ones expand it, so scheduling structured-first (BWT-guided) prevents entropy collapse from damaging open-ended capability, yielding gains over naive joint training Does training order reshape how models handle different task types?. The recurring villain is entropy collapse — RL squeezes behavioral diversity in search agents exactly as it does in reasoning, and SFT on diverse demonstrations is what preserves the exploration breadth a higher ceiling depends on Does reinforcement learning squeeze exploration diversity in search agents?.

Here's the part you didn't know you wanted to know: the ceiling may be set *before* RL even starts. Several notes argue RL doesn't install new capability — it surfaces latent capability. RL teaches a model *when* to reason, not *how*; hybrid models recover 91% of the gains by routing tokens alone, and reasoning activation vectors pre-exist any RL Does RL post-training create reasoning or just deploy it?. Out-of-distribution N-1 tests show even GRPO-trained models sharpen memorized templates rather than learning genuine procedures Do fine-tuned language models actually learn optimization procedures?. And RL collapses onto a single dominant *pretraining* format within the first epoch, with the winner determined by model scale, not performance Does RL training collapse format diversity in pretrained models?. If that's right, two of your biggest ceiling-setting 'recipe' choices are the base model and its pretraining distribution — RL mostly decides which latent capability gets amplified.

One more structural clue: RL touches only 5–30% of parameters, in sparse-but-full-rank subnetworks that are nearly identical across seeds Does reinforcement learning update only a small fraction of parameters?, and training itself runs in two phases — execution-correctness first, strategic-planning second Does RL training follow a predictable two-phase learning sequence?. Together those say the ceiling lives in a small, structured slice of the model, reached in a predictable order — which is exactly why recipe choices that protect diversity and shape reward in the right phase determine where the sigmoid flattens out.

Sources 12 notes

Does RL training follow predictable scaling curves?

Large-scale study (400K GPU-hours, 200+ models) shows RL performance scales sigmoidally. Recipe choices set the ceiling; implementation details only affect efficiency. Stable recipes enable reliable extrapolation from small runs.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Which recipe choices determine the asymptotic ceiling in RL training?

Sources 12 notes

Next inquiring lines