Reinforcement Pre-Training

Paper · arXiv 2506.08007 · Published June 9, 2025
RLVRReinforcement LearningReward ModelsNovel ArchitecturesReasoning Architectures

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement finetuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, largely driven by the scalability of the next-token prediction objective on vast text corpora. This self-supervised paradigm has proven to be an effective general-purpose pre-training approach. Concurrently, reinforcement learning (RL) has emerged as a powerful technique for fine-tuning LLMs, aligning them with human preferences or enhancing specific skills such as complex reasoning [OWJ+22, JKL+24, GYZ+25].

However, current applications of RL in LLM training face scalability and generality challenges. Reinforcement learning from human feedback [OWJ+22], while effective for alignment, relies on costly human preference data, and its learned reward models can be susceptible to reward hacking, limiting scalability. Alternatively, reinforcement learning with verifiable rewards (RLVR) [LMP+25] utilizes objective, rule-based rewards, often from question-answer pairs. While this mitigates reward hacking, RLVR is typically constrained by the scarcity of annotated data with verifiable answers, restricting its application to domain-specific fine-tuning rather than general-purpose pre-training.

In this work, we introduce reinforcement pre-training (RPT), a novel paradigm that bridges the gap between scalable self-supervised pre-training and the power of reinforcement learning. RPT reframes the fundamental next-token prediction task as a next-token reasoning process. For any given context in a pre-training corpus, the model is incentivized to reason about the subsequent token before predicting it. It receives a verifiable, intrinsic reward based on the correctness of its prediction against the ground-truth next token from the corpus itself. This approach transforms the vast, unannotated text data typically used for next-token prediction into a massive dataset for general-purpose RL, without requiring external annotations or domain-specific reward functions.

This approach offers several crucial advantages. First, RPT is inherently scalable and general-purpose: it leverages the same vast, unannotated text data used for standard next-token prediction, transforming it into a massive dataset for general-purpose RL without requiring external annotations. Second, the use of direct, rule-based reward signals (i.e., the correctness of the predicted next token) inherently minimizes the risk of reward hacking often associated with complex, learned reward models. Third, by explicitly encouraging next-token reasoning patterns, RPT promotes deeper understanding and generalization instead of merely memorizing next tokens. The model learns to explore and validate hypotheses about why a certain token should follow, fostering more robust representations. Finally, the internal reasoning process during pre-training effectively allows the model to allocate more “thought” or computational effort to each prediction step, akin to a form of inference-time scaling applied at training time for each token, which directly contributes to improved next-token prediction accuracy.