Thinking Augmented Pre-training
This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to 100B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of 3. For a 3B parameter model, it improves the post-training performance by over 10% on several challenging reasoning benchmarks.
Modern data engineering pipelines (Penedo et al., 2024; AI et al., 2025; Li et al., 2024) for large-scale pre-training are multifaceted processes. They often employ techniques such as parsing, deduplication, filtering, domain balancing, rewriting (Maini et al., 2024), and synthetic data generation (Gunasekar et al., 2023) to enrich the quality and diversity of the resulting training corpus.
Orthogonal to the development of enhanced data curation pipelines, a critical but underexplored dimension is the maximization of utility from existing data. Prior research addresses this challenge through a data selection lens (Lin et al., 2024; Mindermann et al., 2022), proposing to train models exclusively on a subset of valuable tokens that are learnable but are not yet learned. However, some valuable tokens can be exceptionally difficult to learn in a single next-token prediction step, as they often represent the outputs of intricate, multi-step human reasoning processes (Xiang et al., 2025). Figure 1 provides an illustrative example where the correct answer token “890” is derived from a sequence of reasoning steps that necessitate an understanding of polynomial division, the Remainder Theorem, and the properties of divisors. When a model’s capacity is limited, it may struggle to learn such tokens beyond pure memorization, which will not generalize well.
To circumvent these limitations, we introduce a thinking augmented training approach called TPT that automatically expands pre-training datasets and enhances their learnability for LLMs. Our method augments the raw data by generating thinking trajectories using readily available open-source LLMs. These trajectories simulate an expert’s in-depth thought process as they analyze the given text, mirroring the way humans learn new knowledge. Given that explanation is often easier than generation from scratch, models trained on such augmented data can, as our experiments demonstrate, surpass the performance of the thinking generation model itself. TPT is highly scalable as it requires no human annotation and imposes no constraints on document structure.
Thinking pattern analysis reveals that our method naturally up-samples high-quality data, aligning with contemporary data engineering practices that have been empirically validated as effective. For example, thinking trajectories tend to be longer in domains such as mathematics. A positive correlation exists between the reasoning intensity and difficulty of the original text and the thinking length. A longer thinking length implies more training compute allocated to the corresponding tokens. This bears a resemblance to test-time scaling (Jaech et al., 2024) where more difficult samples benefit from increased inference compute. The key distinction is that we apply this principle during training, allocating more training compute to challenging samples, which in turn enhances their learnability for models.
{{CONTEXT}}
End of the context
Simulate an expert’s in-depth thought process as they analyze the above
context, focusing on complex and informative aspects. Skip trivial
details. Use Feynman technique whenever possible to ensure a deep
understanding.
• Dynamic Allocation of Training Compute Valuable tokens can be difficult to learn in a generalizable manner by training on them directly, as exemplified in Figure 1. Thinking augmentation breaks down complex tokens into smaller, more explainable steps, thereby effectively allocating more training compute to them. This is analogous to test-time scaling but applied during training instead of inference. Empirical evidences in Section 4 shows that thinking trajectories tend to be longer for high-value domains and documents, which functions as a natural up-sampling mechanism.