Reinforcement Learning for LLMs

Can chain-of-thought reasoning emerge during pretraining itself?

Does treating reasoning as an exploratory action within the pretraining phase, rather than post-training, allow models to develop stronger reasoning capabilities earlier? This matters because it could reshape when and how we train reasoning into language models.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

The dominant paradigm separates pretraining (next-token prediction) from reasoning (RL post-training with verifiable rewards). RLP challenges this by bringing RL's core mechanism — exploration — into pretraining itself. The key idea: treat chain-of-thought as an exploratory action taken before predicting each next token, with reward computed from the information gain that thought provides.

The reward signal is elegant: measure the increase in log-likelihood of the observed token when conditioning on both context and a sampled reasoning chain, compared to context alone. This is verifier-free (no task-specific checkers needed), dense (assigns credit at every position), and applicable to ordinary web-scale text during pretraining. The model learns to think for itself before predicting what comes next, teaching independent thinking behavior earlier in training.

Results compound: pretraining with RLP on Qwen3-1.7B lifts the average across eight math-and-science benchmarks by 19%. With identical post-training, gains compound further. Applied to Nemotron-Nano-12B, overall average increases from 42.81% to 61.32%. The largest improvements are on reasoning-heavy tasks like AIME25 and MMLU-Pro.

This is significant because it reframes when reasoning should be learned. Since Do base models already contain hidden reasoning ability?, RLP suggests that pretraining itself can plant stronger reasoning seeds. And since Does RL teach reasoning or just when to use it?, RLP may teach the "how" during pretraining, leaving post-training to teach the "when" — a cleaner division of labor.

Unlike prior reinforcement pretraining (RPT) which uses sparse binary rewards and relies on proxy-model filtering, RLP provides continuous improvement signals at every position and trains on full documents, eliminating the need to preselect high-entropy tokens.


Source: Reinforcement Learning

Related concepts in this collection

Concept map
15 direct connections · 108 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

chain-of-thought as pretraining exploratory action with information-gain reward bridges next-token prediction and reasoning emergence