Why does training format shape reasoning strategy more than domain?

This explores why *how* a model is trained — the shape of its training examples, like multiple-choice vs. free-form — ends up steering its reasoning style more strongly than *what* subject it was trained on.

This explores why training format (the structure of the examples) shapes a model's reasoning strategy more than the domain (the subject matter). The headline result is striking: format shapes reasoning strategy about 7.5 times more than domain does, with multiple-choice training pushing models toward breadth-first exploration while free-form training produces depth-first reasoning Does training data format shape reasoning strategy more than domain?. In other words, presentation teaches the model a *habit of thinking*, and that habit travels across whatever topic you point it at.

The reason this happens becomes clearer once you stop thinking of post-training as *creating* reasoning and start thinking of it as *selecting* it. Several independent lines of evidence suggest base models already carry latent reasoning capability, and that minimal training merely unlocks or routes it rather than building it from scratch Do base models already contain hidden reasoning ability?. One framing puts it sharply: RL post-training teaches a model *when* to reason, not *how* — hybrid models recover most of the gains just by changing which tokens get the reasoning treatment Does RL post-training create reasoning or just deploy it?. If the raw reasoning machinery is already present, then the training signal's main job is to pick a deployment pattern — and the *format* of your data is exactly the thing that encodes that pattern. Multiple-choice formats reward scanning options; free-form formats reward following a single thread deep. The domain is almost incidental.

There's a deeper architectural hint for why format and content separate so cleanly. Knowledge tends to live in the lower layers of the network while reasoning adjustments happen in higher layers Why does reasoning training help math but hurt medical tasks?. That separation is why reasoning training can sharpen math while quietly degrading knowledge-heavy domains like medicine — the reasoning *strategy* being shaped is somewhat decoupled from the domain *facts* being stored. Reasoning style is even directionally steerable: verbose versus concise chains of thought occupy distinct linear regions of activation space, and you can slide between them with a single extracted vector and no retraining at all Can we steer reasoning toward brevity without retraining?. If a whole reasoning style is one steerable direction, it makes sense that the *format* of training — which consistently pushes activations one way — dominates the relatively diffuse signal of domain content.

The twist worth sitting with: a model can learn the *form* of reasoning without the underlying logic. Chain-of-thought degrades predictably once you shift the task, length, or format away from training distribution — producing fluent but logically inconsistent reasoning, an imitation of reasoning's shape rather than the thing itself Does chain-of-thought reasoning actually generalize beyond training data?. This is the dark side of format dominance: if format is what gets learned most strongly, then a model trained on one format may be performing a *style* of reasoning that collapses the moment the surface form changes. Related work shows reasoning models often "wander" unsystematically rather than search validly, which is what you'd expect if they absorbed a formatting habit rather than a sound procedure Why do reasoning LLMs fail at deeper problem solving?.

What ties this together is the pretraining-side finding that reasoning generalization rides on broad, transferable *procedural* knowledge — the how-to patterns scattered across many documents — rather than narrow factual recall tied to specific texts Does procedural knowledge drive reasoning more than factual retrieval?. Procedure is format-like; facts are domain-like. So the whole stack lines up: reasoning is procedural and transferable, the format of your examples teaches a procedure, and that procedure outweighs the particular facts of any one field. The thing you didn't know you wanted to know is that 'teaching a model to reason' is often really 'teaching it a presentation habit' — and choosing your data format may matter more than choosing your subject.

Sources 8 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why does training format shape reasoning strategy more than domain?

Sources 8 notes

Next inquiring lines