Can chain-of-thought reasoning emerge during pretraining itself?
Does treating reasoning as an exploratory action within the pretraining phase, rather than post-training, allow models to develop stronger reasoning capabilities earlier? This matters because it could reshape when and how we train reasoning into language models.
The dominant paradigm separates pretraining (next-token prediction) from reasoning (RL post-training with verifiable rewards). RLP challenges this by bringing RL's core mechanism — exploration — into pretraining itself. The key idea: treat chain-of-thought as an exploratory action taken before predicting each next token, with reward computed from the information gain that thought provides.
The reward signal is elegant: measure the increase in log-likelihood of the observed token when conditioning on both context and a sampled reasoning chain, compared to context alone. This is verifier-free (no task-specific checkers needed), dense (assigns credit at every position), and applicable to ordinary web-scale text during pretraining. The model learns to think for itself before predicting what comes next, teaching independent thinking behavior earlier in training.
Results compound: pretraining with RLP on Qwen3-1.7B lifts the average across eight math-and-science benchmarks by 19%. With identical post-training, gains compound further. Applied to Nemotron-Nano-12B, overall average increases from 42.81% to 61.32%. The largest improvements are on reasoning-heavy tasks like AIME25 and MMLU-Pro.
This is significant because it reframes when reasoning should be learned. Since Do base models already contain hidden reasoning ability?, RLP suggests that pretraining itself can plant stronger reasoning seeds. And since Does RL teach reasoning or just when to use it?, RLP may teach the "how" during pretraining, leaving post-training to teach the "when" — a cleaner division of labor.
Unlike prior reinforcement pretraining (RPT) which uses sparse binary rewards and relies on proxy-model filtering, RLP provides continuous improvement signals at every position and trains on full documents, eliminating the need to preselect high-entropy tokens.
Source: Reinforcement Learning
Related concepts in this collection
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
extends: RLP strengthens the latent reasoning during pretraining itself
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
complements: RLP may teach "how" during pretraining, leaving post-training for "when"
-
Can models learn reasoning from predicting text alone?
Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.
parallels: both generate internal rationales at token level with self-supervised reward, but RLP operates during pretraining
-
Can adversarial training replace task-specific verifiers for reasoning?
Does an adversarial game between policy and critic provide sufficient reward signal for reasoning tasks when ground-truth verifiers don't exist? This matters because most reasoning domains lack verifiers but have abundant expert demonstrations.
connects: both achieve verifier-free reasoning training but via different mechanisms
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
chain-of-thought as pretraining exploratory action with information-gain reward bridges next-token prediction and reasoning emergence