What makes token-level reasoning during pretraining different from test-time chain-of-thought?
This explores the difference between reasoning that gets baked into a model's weights while it's still learning to predict text (pretraining), versus the step-by-step 'thinking out loud' a finished model produces when you ask it a question (test-time chain-of-thought).
This explores the difference between reasoning planted during pretraining and the chain-of-thought a model generates at inference. The cleanest way to see the split: pretraining-time reasoning changes what the model *is*, while test-time chain-of-thought is something the model *does* on demand. Two corpus threads make the pretraining case concrete. Reinforcement Pre-Training reframes ordinary next-token prediction as a reasoning task, drawing a verifiable reward straight from the corpus itself — the next token either matches or it doesn't, so there's nothing to game Can next-token prediction become a reasoning task with RL?. RLP pushes the same idea with a verifier-free twist: it treats each chain-of-thought as an exploratory action and rewards it by how much it improves the model's prediction of what comes next, lifting downstream reasoning by ~19% Can chain-of-thought reasoning be learned during pretraining itself?. The signal in both cases is intrinsic to text, available at massive scale, and shapes the weights before any fine-tuning happens.
Test-time chain-of-thought looks very different under scrutiny. A cluster of notes argues it's largely *imitation of reasoning form* rather than fresh inference — models reproduce familiar reasoning schemata learned in training, which is why structurally invalid prompts still 'work' and why performance degrades predictably the moment you shift the task, length, or format away from the training distribution Does chain-of-thought reasoning reveal genuine inference or pattern matching? Does chain-of-thought reasoning actually generalize beyond training data? What makes chain-of-thought reasoning actually work?. The most striking evidence that test-time traces aren't carrying genuine logic: deliberately corrupted, semantically irrelevant traces teach about as well as correct ones, which suggests the visible steps act as computational scaffolding rather than meaningful thought Do reasoning traces need to be semantically correct?.
Here's the connective tissue you might not expect: a strong line of work claims the reasoning capability already lives in the base model, and post-training merely *selects* it — five independent methods (RL steering, critique tuning, decoding changes, feature steering, RLVR) all elicit reasoning that was latent in base-model activations Do base models already contain hidden reasoning ability?. If that's right, then pretraining-time methods and test-time chain-of-thought are doing related jobs at different stages: pretraining plants and strengthens the latent capability, while test-time generation is one (lossy, imitation-prone) way of unlocking it. That reframes the comparison from 'two kinds of reasoning' to 'where in the pipeline the reasoning gets put in versus pulled out.'
The token-level view sharpens the contrast further. At test time, only a sparse minority of tokens actually matter — roughly 20% are high-entropy 'forking points' that carry the RLVR learning signal Do high-entropy tokens drive reasoning model improvements?, and specific reflection tokens like 'Wait' and 'Therefore' spike in mutual information with the correct answer Do reflection tokens carry more information about correct answers?. But generating more of these visible steps isn't free: accuracy peaks then *declines* past a thinking-token threshold (87% down to 70% as tokens grew from ~1,100 to ~16K), and local token-to-token memorization drives up to 67% of reasoning errors Does more thinking time always improve reasoning accuracy? Where do memorization errors arise in chain-of-thought reasoning?. Pretraining-time reasoning doesn't pay this per-token generation tax — the work is amortized into the weights once, rather than re-spent (and re-corrupted) every inference.
The thing worth taking away: the very act of *verbalizing* reasoning may be a training artifact, not a requirement. Depth-recurrent architectures, Coconut, and Heima scale test-time compute entirely in latent space — iterating hidden states without emitting any thinking tokens at all Can models reason without generating visible thinking tokens?. So the deepest difference isn't pretraining-vs-inference timing; it's that visible chain-of-thought tokens are just one surface for reasoning that can equally live silently in pretrained weights or in continuous hidden states. The text you see a model 'think' is the least essential part.
Sources 12 notes
Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.