Can pretraining corpora themselves provide verifiable RL rewards?
Does framing next-token prediction as a reasoning task with ground-truth verification eliminate the need for human feedback or domain-specific rewards during language model pretraining?
Reinforcement Pre-Training (RPT) bridges self-supervised pretraining and reinforcement learning by reframing next-token prediction as next-token reasoning. For any context in a pretraining corpus, the model is incentivized to reason about the subsequent token before predicting it, receiving a verifiable reward based on prediction correctness against the ground-truth next token.
This transforms the scalability bottleneck of RL for LLMs. Standard RLHF requires costly human preferences. RLVR requires domain-specific verifiable answers. RPT requires nothing beyond the pretraining corpus — the ground-truth next token is the verifiable reward. The entire internet becomes RL training data.
Three structural advantages emerge. First, the reward signal is rule-based (correct/incorrect next-token prediction), which inherently minimizes reward hacking — there is no learned reward model to exploit. Second, by encouraging reasoning patterns before each prediction, RPT promotes deeper understanding rather than surface memorization of token sequences. Third, the internal reasoning process allocates more computational effort per prediction step — a form of inference-time scaling applied at training time.
Since Can models learn reasoning from predicting text alone?, RPT operates at the same granularity but with a fundamentally different mechanism. Quiet-STaR learns to generate useful rationales between tokens via a reinforcement signal. RPT learns to reason about what comes next via next-token verification. Both suggest that token-level reasoning during pretraining is a viable path to general reasoning capability.
The scaling curves show consistent improvement with increased training compute — more RPT training means better next-token prediction accuracy. RPT also provides a strong foundation for subsequent reinforcement fine-tuning, suggesting the reasoning patterns learned during pretraining compose with downstream RL rather than conflicting with it.
Source: RLVR
Related concepts in this collection
-
Can models learn reasoning from predicting text alone?
Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.
parallel token-level reasoning integration during pretraining
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
RPT may create stronger latent capabilities than standard pretraining
-
Can chain-of-thought reasoning emerge during pretraining itself?
Does treating reasoning as an exploratory action within the pretraining phase, rather than post-training, allow models to develop stronger reasoning capabilities earlier? This matters because it could reshape when and how we train reasoning into language models.
RPT is the RL-native version of this bridge
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
RPT strengthens what RL post-training later activates: if pretraining embeds RL-trained reasoning patterns, the latent capability that post-training teaches "when" to deploy is richer than standard pretraining would produce
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
reinforcement pre-training reframes next-token prediction as a reasoning task trained with rl — using the corpus itself as verifiable reward