Can pretraining corpora themselves provide verifiable RL rewards?

Does framing next-token prediction as a reasoning task with ground-truth verification eliminate the need for human feedback or domain-specific rewards during language model pretraining?

Note · 2026-02-22 · sourced from RLVR

Reinforcement Pre-Training (RPT) bridges self-supervised pretraining and reinforcement learning by reframing next-token prediction as next-token reasoning. For any context in a pretraining corpus, the model is incentivized to reason about the subsequent token before predicting it, receiving a verifiable reward based on prediction correctness against the ground-truth next token.

This transforms the scalability bottleneck of RL for LLMs. Standard RLHF requires costly human preferences. RLVR requires domain-specific verifiable answers. RPT requires nothing beyond the pretraining corpus — the ground-truth next token is the verifiable reward. The entire internet becomes RL training data.

Three structural advantages emerge. First, the reward signal is rule-based (correct/incorrect next-token prediction), which inherently minimizes reward hacking — there is no learned reward model to exploit. Second, by encouraging reasoning patterns before each prediction, RPT promotes deeper understanding rather than surface memorization of token sequences. Third, the internal reasoning process allocates more computational effort per prediction step — a form of inference-time scaling applied at training time.

Since Can models learn reasoning from predicting text alone?, RPT operates at the same granularity but with a fundamentally different mechanism. Quiet-STaR learns to generate useful rationales between tokens via a reinforcement signal. RPT learns to reason about what comes next via next-token verification. Both suggest that token-level reasoning during pretraining is a viable path to general reasoning capability.

The scaling curves show consistent improvement with increased training compute — more RPT training means better next-token prediction accuracy. RPT also provides a strong foundation for subsequent reinforcement fine-tuning, suggesting the reasoning patterns learned during pretraining compose with downstream RL rather than conflicting with it.

Source: RLVR

Related concepts in this collection

Can models learn reasoning from predicting text alone? Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.
parallel token-level reasoning integration during pretraining
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
RPT may create stronger latent capabilities than standard pretraining
Can chain-of-thought reasoning emerge during pretraining itself? Does treating reasoning as an exploratory action within the pretraining phase, rather than post-training, allow models to develop stronger reasoning capabilities earlier? This matters because it could reshape when and how we train reasoning into language models.
RPT is the RL-native version of this bridge
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
RPT strengthens what RL post-training later activates: if pretraining embeds RL-trained reasoning patterns, the latent capability that post-training teaches "when" to deploy is richer than standard pretraining would produce

Concept map

14 direct connections · 105 in 2-hop network ·medium cluster

Can pretraining corpora themselves provide verif… Can models learn reasoning from predicting text al… Do base models already contain hidden reasoning ab… Can chain-of-thought reasoning emerge during pretr… Does RL teach reasoning or just when to use it?

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

reinforcement pre-training reframes next-token prediction as a reasoning task trained with rl — using the corpus itself as verifiable reward