How much does training data format influence reasoning strategy versus domain content?
This explores whether the *shape* of training data (multiple-choice vs. free-form, format and presentation) does more to determine how a model reasons than the actual subject matter it's trained on.
This explores whether the *shape* of training data shapes a model's reasoning strategy more than its subject matter — and the corpus has a surprisingly crisp answer. The headline result is that format dominates: models trained on multiple-choice data adopt breadth-first exploration, while free-form training produces depth-first reasoning, and the format effect outweighs the domain effect by roughly 7.5 to 1 Does training data format shape reasoning strategy more than domain?. In other words, *how* a problem is presented teaches the model a habit of thinking that travels across whatever topic you throw at it.
The reason this isn't just a quirk becomes clearer when you look at where reasoning actually comes from. Reasoning ability seems to be drawn from broad, transferable procedural patterns picked up across many pretraining documents — unlike factual recall, which depends on narrowly memorizing specific source documents Does procedural knowledge drive reasoning more than factual retrieval?. If reasoning is a procedure rather than a fact, then it makes sense that the *form* of the data — the structural template it presents — would imprint the procedure more strongly than the content does. Several lines of work go further and argue the reasoning is already latent in the base model: post-training selects and elicits it rather than creating it Do base models already contain hidden reasoning ability?, with RL teaching a model *when* to deploy reasoning rather than *how* to do it Does RL post-training create reasoning or just deploy it?. Under that view, training format is essentially a steering signal that picks which pre-existing strategy gets activated.
There's a sharp downside to format being this powerful, though. If models are imitating the *form* of reasoning rather than its underlying logic, they should break when the form shifts — and they do. Chain-of-thought degrades predictably under distributional shifts in task, length, and format, producing fluent but logically inconsistent reasoning Does chain-of-thought reasoning actually generalize beyond training data?. Reasoning accuracy also collapses just from longer inputs, well below the context limit, in a way that's task-agnostic Does reasoning ability actually degrade with longer inputs?. These are the symptoms you'd expect from a system that learned a presentational pattern rather than a content-general competence.
The lateral surprise here is how *local* the format signal turns out to be. Reasoning improvements during RLVR are concentrated in a small minority — only about 20% of tokens are high-entropy "forking points," and training on those alone matches full updates Do high-entropy tokens drive reasoning model improvements?. Reasoning verbosity is even a single linear direction in activation space you can steer without retraining Can we steer reasoning toward brevity without retraining?. So "format shapes strategy" cashes out concretely: format is nudging a handful of decision tokens and a few activation directions, not rewriting the model's knowledge. The flip side is that domain-specific adaptation methods each have narrow sweet spots and tend to carry hidden costs — gains in one place quietly degrading reasoning faithfulness or format flexibility elsewhere How do domain training techniques actually reshape model behavior? — which is exactly why content-tuning struggles to compete with format for control over reasoning style.
The takeaway you didn't know you wanted: if you care about *how* a model reasons, you may get more leverage from changing the presentation of your training examples — or steering an activation vector — than from feeding it more domain content. Approaches like learning rationales at the token level on arbitrary text Can models learn reasoning from predicting any text? lean into exactly this, treating reasoning as a format-level skill that emerges independent of any particular subject.
Sources 10 notes
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.