Does task diversity in pretraining data transfer reasoning better than larger models?
This explores whether reasoning ability comes more from *what's in the pretraining data* — especially varied procedural examples — than from simply scaling models bigger, and the corpus actually reframes the question: the real lever may be elicitation and data composition, not parameter count.
This explores whether reasoning transfers better through diverse, procedure-rich pretraining data than through sheer model size — and the corpus suggests the question's instinct is right, but for a deeper reason than "diversity beats scale." The most direct evidence comes from an analysis of five million pretraining documents showing that reasoning leans on *broad, transferable procedural knowledge* drawn from many varied sources, while factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. In other words, what makes a model reason isn't memorizing answers — it's having absorbed many worked examples of *how to do things*. Diversity of procedure, not volume of facts, is the active ingredient.
But here the corpus complicates the framing in a useful way: several lines of work argue that reasoning isn't really "transferred" or "created" at scale at all — it's *already latent* and merely unlocked. Five independent methods (RL steering, critique fine-tuning, decoding tweaks, feature steering, RLVR) all elicit reasoning that base models already contain, suggesting post-training selects rather than builds capability Do base models already contain hidden reasoning ability?. A companion finding sharpens this: RL post-training teaches a model *when* to reason, not *how* — hybrid models recover 91% of the gains by routing tokens alone Does RL post-training create reasoning or just deploy it?. If reasoning is latent in the base model, then the pretraining data that seeded it — and how procedurally diverse it was — matters more than anything bolted on later.
That said, scale doesn't vanish. Reasoning-trained models persistently beat non-reasoning ones no matter how much inference compute the latter are given, because training instills a *protocol* that makes extra tokens productive Can non-reasoning models catch up with more compute?. So it's not size per se but *training regime* that draws the line. And small models can punch far above their weight: DPO-trained small models match large ones on function-calling and math by learning from a teacher's correct-and-incorrect examples Can small models match large models on function calling?. That's a direct existence proof that the right data composition closes a size gap.
There's a sharp limit worth knowing, though, that cuts against naive optimism about diversity. Reasoning failures turn out to be driven by *instance-level unfamiliarity*, not task complexity — models fit patterns from instances they've seen rather than learning general algorithms, so a chain succeeds only when something similar was in training Do language models fail at reasoning due to complexity or novelty?. Chain-of-thought degrades predictably the moment you shift task, length, or format away from the training distribution Does chain-of-thought reasoning actually generalize beyond training data?. This is why task diversity matters mechanically: broader coverage of procedures and instances is what *widens the distribution* where reasoning holds — but it never makes the model distribution-free.
The takeaway you didn't know you wanted: the diversity-vs-scale framing is partly a false binary. Diverse procedural data and varied *task scheduling* during training (e.g., training structured tasks before creative ones to avoid entropy collapse Does training order reshape how models handle different task types?) shape *which* reasoning a model can deploy and *how far* it generalizes. Scale mostly determines headroom. So for transferring reasoning, betting on richer, more varied procedural data is generally the better marginal investment than betting on parameters alone — as long as you remember the model is generalizing from familiar instances, not reasoning from first principles.
Sources 8 notes
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.