Why does the pretrained prior determine the exploration ceiling?
This explores why the abilities baked in during pretraining seem to set a hard limit on how far later reinforcement learning can push a model to discover new behaviors — and whether RL adds capability or just selects from what's already there.
This explores why the abilities baked in during pretraining seem to set a hard limit on how far later reinforcement learning can push a model to discover new behaviors. The corpus points to a striking answer: post-training mostly *selects* from a menu the pretrained model already wrote, rather than writing new menu items. Several independent mechanisms — RL steering, critique fine-tuning, decoding tricks, feature steering, and RLVR — all turn out to elicit reasoning that was already latent in base-model activations, which suggests the real bottleneck is elicitation, not capability acquisition Do base models already contain hidden reasoning ability?. If the behavior was never in the prior, no amount of reward-chasing conjures it.
The ceiling becomes visible in how RL narrows things. Reinforcement learning tends to amplify a single dominant format that already existed in the pretraining distribution within the first epoch, while quietly suppressing the alternatives — and which format wins depends on model scale, not necessarily on performance Does RL training collapse format diversity in pretrained models?. The same compression shows up in search agents, where RL collapses behavioral diversity through the familiar entropy-collapse mechanism, converging on narrow reward-maximizing strategies Does reinforcement learning squeeze exploration diversity in search agents?. RL is a funnel: it sharpens the prior's strongest mode and discards the rest. That's powerful when the right behavior is already a strong mode, and useless when it isn't.
The ceiling has a second source — the data a model ever gets to imagine from. Agents trained on static expert demonstrations are capped by what the curators imagined, because they never interact with an environment to learn from their own failures Can agents learn beyond what their training data shows?. So the 'prior' isn't only the base model's weights; it's the horizon of scenarios baked into training. Push beyond that horizon with overly hard problems and the model doesn't reach higher — it learns degenerate shortcuts that even contaminate skills it previously had, because rare accidental successes get treated as high-value trajectories Do overly hard RLVR samples actually harm model capabilities?.
Here's the twist worth knowing: the ceiling is partly about *timing and signal*, not just raw capability. Models commit to choices prematurely because uncertainty signals dominate early transformer layers while the long-horizon 'empowerment' signals that favor exploration only emerge in the middle layers — a temporal mismatch that throttles exploration before it starts Why do large language models explore less effectively than humans?. And the apparent exploration-vs-exploitation trade-off may itself be a measurement artifact that only appears at the token level, vanishing under hidden-state analysis Is the exploration-exploitation trade-off actually fundamental?. So part of the 'ceiling' is the prior failing to surface what it already contains, not a true absence of ability.
Which is exactly why the most promising work moves the action *into* pretraining or stages it carefully. RLP treats chain-of-thought as an exploratory action during pretraining itself, planting reasoning earlier and lifting benchmarks ~19% Can chain-of-thought reasoning be learned during pretraining itself?; and curricula that run supervised RL first to build a richer prior, then RLVR to sharpen it, beat either method alone because the imitation phase creates the reasonable rollouts RL needs to be informative Does sequencing imitation then exploration training improve reasoning?. The through-line: if you want a higher exploration ceiling, you raise the prior — RL alone can only spend what pretraining already deposited.
Sources 9 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
SAE decomposition shows uncertainty values dominate early transformer blocks while empowerment representations emerge only in middle blocks. This temporal mismatch causes models to commit to decisions before long-term exploration signals can influence them. Reasoning-trained o1 overcomes this by extending computation time.
Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.