Why does training data format shape reasoning strategy more than content?

This explores why *how* training data is presented — its shape, like multiple-choice vs. free-form — pushes a model toward a particular reasoning style more strongly than *what* the data is about, and why that's so.

This explores why the format of training data steers reasoning strategy more than the subject matter — and the corpus points to one underlying reason: models learn the *shape* of reasoning far more readily than its substance. The headline result is stark. When you train on multiple-choice data, models adopt breadth-first exploration; train on free-form data and they go depth-first instead — and this format effect outweighs the domain effect by about 7.5x Does training data format shape reasoning strategy more than domain?. Presentation, not content type, sets the reasoning style.

Why would form dominate so completely? Because a striking body of evidence suggests that what looks like reasoning is largely the *imitation of reasoning's form*. Chain-of-thought exemplars that are logically invalid perform nearly as well as valid ones — it's the structural pattern, not the logic, that drives the gains Does logical validity actually drive chain-of-thought gains?. Push further and you find that deliberately corrupted, irrelevant reasoning traces teach about as well as correct ones, behaving like computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?. If correctness of content barely matters, it follows that the format — which is what's actually being absorbed — is what shapes the strategy.

There's a deeper mechanism beneath this. Several lines of work argue that base models already contain latent reasoning capability, and post-training mostly *selects* or *organizes* it rather than creating it Do base models already contain hidden reasoning ability?. RL post-training, on this view, teaches a model *when* to reason, not *how* — the strategies pre-exist as directions in activation space Does RL post-training create reasoning or just deploy it?. A 1.5B model with LoRA-only tuning can match much larger RL models by learning output *format* alone, suggesting reasoning organization and factual knowledge are separable Can small models reason well by just learning output format?. If training is fundamentally an act of *eliciting and routing* capability that's already there, then the format of the data is the lever that decides which pre-existing pattern gets switched on — content just rides along.

The flip side is worth noting, because it sharpens the boundary. Content *does* matter for one thing: the procedural knowledge a model can draw on. Analysis of millions of pretraining documents shows reasoning generalizes from broad, transferable procedural patterns, while factual recall depends on narrow document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. So content builds the *repertoire*; format selects the *strategy*. And the format-driven story has a cost — it's brittle. Chain-of-thought degrades predictably the moment you shift task, length, or format away from the training distribution, producing fluent but logically inconsistent output Does chain-of-thought reasoning actually generalize beyond training data?. The very fact that format transfers so powerfully is also why models break when the format changes.

The unsettling takeaway: if format shapes strategy more than content, then benchmark accuracy can rise while genuine reasoning quality falls. Supervised fine-tuning lifts final-answer scores while cutting the actual inferential information gain by 39% — the model learns to produce correct-looking answers through post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?. You're often training the costume of reasoning, not the reasoning.

Sources 9 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Why does training data format shape reasoning strategy more than content?

Sources 9 notes

Next inquiring lines