INQUIRING LINE

Does the pretrained model prior limit RL search capability more than the optimization algorithm itself?

This explores whether RL's reach is bounded mainly by what the pretrained base model already contains — its latent skills and distribution — rather than by which optimization algorithm you bolt on top.


This explores whether RL's reach is bounded mainly by what the pretrained base model already contains rather than by the optimizer you choose, and the corpus leans hard toward the prior as the binding constraint. The clearest statement is that RL post-training teaches a model *when* to reason, not *how* — base models already carry reasoning strategies in latent form, and RL mostly optimizes deployment timing; hybrid setups recover ~91% of the gains by routing tokens alone, and the activation patterns for reasoning exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?. If the capability is already sitting in the prior, then the algorithm is steering, not generating.

The mechanics of what RL actually changes reinforce this. RL updates only 5–30% of parameters, in sparse but nearly full-rank subnetworks that are almost identical across random seeds — structural, not arbitrary, selection Does reinforcement learning update only a small fraction of parameters?. And the dominant mechanism looks like *suppression*: RL works largely by negative reinforcement, damping wrong trajectories rather than installing new ones What actually changes inside a model during RL training?. A small, suppression-driven edit is a poor candidate for expanding a search space the base model couldn't already enter.

The sharpest evidence that RL narrows rather than widens search comes from diversity studies. RL training collapses behavioral diversity — search agents converge on narrow reward-maximizing strategies through the same entropy-collapse mechanism seen in reasoning, while SFT on diverse demonstrations preserves exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. RL also converges on a single dominant *pretraining* format within the first epoch, amplifying one distribution the base model already prefers and suppressing the rest — and which format wins depends on model scale, not performance Does RL training collapse format diversity in pretrained models?. The prior literally picks the lane. Out-of-distribution probes drive the point home: even GRPO-trained models drop sharply on N-1 variants, suggesting RL sharpens template-matching against the prior rather than installing transferable procedures Do fine-tuned language models actually learn optimization procedures?.

But here's the twist worth sitting with: the algorithm isn't innocent — it's just that its failures are about *corrupting* the prior, not failing to exceed it. Overly hard RLVR samples push models into degenerate shortcuts that contaminate pre-existing capabilities, because group-relative normalization treats rare lucky successes as high-advantage signal Do overly hard RLVR samples actually harm model capabilities?. Binary rewards provably wreck calibration by rewarding confident guessing Does binary reward training hurt model calibration?. So the optimizer can make a model *worse* than its prior, even as it struggles to make it better than its prior — an asymmetry that quietly confirms the prior is the ceiling.

The most interesting dissent is search-as-exploration. MCTS-based self-improvement uses tree search to surface and rank solution paths the model wouldn't reliably reach greedily, generating dense process-level signal without human labels Can tree search replace human feedback in LLM training?, and curriculum and entropy-scheduling tricks can deliberately keep open-ended capability alive instead of letting it collapse Does training order reshape how models handle different task types?. The unstated lesson across the corpus: if you want RL to *search* rather than merely *sharpen*, the lever is preserving and reorganizing the prior's diversity — not swapping in a cleverer optimizer.


Sources 10 notes

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Next inquiring lines