How much does pre-training frequency predict reasoning task performance?
This explores whether how often something appears in pretraining data predicts how well a model reasons about it — and the corpus suggests reasoning breaks the frequency rule that governs factual recall.
This reads the question as: does the raw frequency of content in pretraining data predict reasoning performance the way it predicts fact retrieval? The most direct answer in the collection is that reasoning and fact-recall draw on different statistical foundations. A study of five million pretraining documents found that factual recall depends on narrow, document-specific memorization — the model basically needs to have seen that fact, repeatedly — while reasoning leans on broad, transferable procedural knowledge spread across many diverse sources Does procedural knowledge drive reasoning more than factual retrieval?. So for reasoning, it's less 'how many times did the model see this exact problem' and more 'how much general procedure for this kind of step did it absorb.' Frequency predicts memorization well; it predicts reasoning poorly.
That distinction gets sharper when you look at where the abilities live inside the network. One line of work locates knowledge retrieval in the lower layers and reasoning adjustment in the higher ones Why does reasoning training help math but hurt medical tasks?. This is why reasoning-focused training can lift math scores while quietly degrading knowledge-heavy domains like medicine — the two capabilities are mechanically separate, so the frequency-driven recall layer and the procedure-driven reasoning layer don't move together.
There's a second, surprising wrinkle: a lot of reasoning ability is already latent in the base model before any reasoning-specific training, waiting to be elicited rather than built. Five independent methods — RL steering, critique fine-tuning, decoding tweaks, feature steering, RLVR — all unlock reasoning that pretraining already deposited Do base models already contain hidden reasoning ability?. That reframes the frequency question entirely: the bottleneck isn't seeing enough reasoning examples, it's accessing what diverse procedural exposure already planted. And you can push that exposure earlier — treating chain-of-thought as an exploratory action rewarded during pretraining itself lifts math and science benchmarks by ~19% Can chain-of-thought reasoning be learned during pretraining itself?.
The limits of frequency show up most clearly when distribution shifts. Chain-of-thought reasoning degrades predictably the moment a task drifts from its training distribution in task, length, or format — models keep producing fluent reasoning that's logically hollow Does chain-of-thought reasoning actually generalize beyond training data?. If reasoning were genuinely frequency-robust procedural skill, it would transfer; instead it often turns out to be pattern-matching on familiar forms, which is exactly what a frequency story would predict. The same fragility appears with mere input length: accuracy can fall from 92% to 68% with a few thousand tokens of padding, far below the context limit and uncorrelated with language-modeling quality Does reasoning ability actually degrade with longer inputs?.
The honest synthesis: frequency strongly predicts what a model can recall and weakly predicts what it can reason through. Diverse procedural exposure matters more than repetition, the capability is often already latent rather than frequency-limited, and where it does look frequency-bound, that's usually a sign the 'reasoning' is shallow imitation of seen forms rather than the real thing. The collection doesn't offer a clean quantitative coefficient — it offers something more useful: frequency is the wrong axis for reasoning, and asking it exposes the gap between recall and genuine inference.
Sources 6 notes
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.