Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can training data itself teach harder reasoning steps?

Can augmenting pretraining data with generated reasoning trajectories help models learn complex multi-step reasoning more efficiently? This explores whether intermediate explanations in training data unlock capabilities standard next-token prediction misses.

Note · 2026-02-22 · sourced from LLM Architecture

Thinking augmented Pre-Training (TPT, 2509.20186) introduces a simple insight: some valuable tokens are too hard to learn in a single next-token prediction step because they represent the output of complex multi-step human reasoning. Rather than modifying the architecture, TPT augments the training data itself — generating thinking trajectories using open-source LLMs and interleaving them with the original text.

The key finding: 3x improvement in data efficiency, with 10%+ gains on reasoning benchmarks for a 3B model. No architecture changes. No human annotation. The thinking trajectories simulate an expert's analysis of the text, decomposing hard tokens into learnable intermediate steps.

The mechanism has a natural self-organizing property. Thinking trajectories are longer for domains like mathematics where reasoning is more intensive. A positive correlation exists between reasoning intensity of the original text and thinking length. This means harder tokens automatically receive more training compute through longer trajectories — functioning as a natural up-sampling mechanism for high-value data.

This is the training-time analog of test-time scaling. Since Can inference compute replace scaling up model size?, TPT shows the same principle operates during training: allocate more compute to harder tokens. The difference is the intervention point — training rather than inference.

The connection to Can pretraining corpora themselves provide verifiable RL rewards? is complementary. RPT changes the training objective (RL instead of NTP). TPT changes the training data (augmented with thinking). Both target the same problem — standard NTP is insufficient for learning complex reasoning from data — but intervene at different levels.

Since Do base models already contain hidden reasoning ability?, TPT provides a pretraining-time mechanism for strengthening these latent capabilities. The thinking trajectories may serve as the training-time equivalent of the "minimal signals" that activate reasoning — making reasoning patterns more available for later post-training to refine.

A notable finding: the model trained on augmented data can surpass the performance of the LLM that generated the thinking trajectories. Explanation is easier than generation from scratch, so the student benefits from the teacher's explanatory labor even when the teacher's own generation capabilities are limited.

Source: LLM Architecture

Related concepts in this collection

Can pretraining corpora themselves provide verifiable RL rewards? Does framing next-token prediction as a reasoning task with ground-truth verification eliminate the need for human feedback or domain-specific rewards during language model pretraining?
complementary approach: changes objective vs. changes data
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
TPT may strengthen latent capabilities that post-training later activates
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
thinking trajectories are a format intervention
Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
same principle at training time
Can models learn reasoning from predicting text alone? Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.
parallel token-level reasoning during pretraining through different mechanisms: TPT embeds externally-generated thinking trajectories in training data, while Quiet-STaR learns to generate rationales at every token position via REINFORCE; TPT intervenes on data, Quiet-STaR on the training objective, but both make pretraining reasoning-aware at the individual token level

Concept map

18 direct connections · 173 in 2-hop network ·dense cluster

Can training data itself teach harder reasoning … Can pretraining corpora themselves provide verifia… Do base models already contain hidden reasoning ab… Does training data format shape reasoning strategy… Can inference compute replace scaling up model siz… Can models learn reasoning from predicting text al…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

thinking-augmented pre-training increases data efficiency 3x by applying test-time scaling principles at training time