Does training data format shape reasoning strategy more than domain content?

This explores whether *how* training data is presented — its format, like multiple-choice vs. free-form — shapes the reasoning style a model adopts more than *what* the data is about (its subject domain).

This explores whether the shape of training data — multiple-choice vs. free-form, for instance — molds a model's reasoning strategy more than the subject matter does. The corpus answers this surprisingly directly: yes, and by a wide margin. One study found that training *format* shaped reasoning strategy roughly 7.5 times more strongly than domain content. Models fed multiple-choice data learned to reason broadly, scanning many options before committing (breadth-first), while free-form training pushed them toward following one line of thought deeply (depth-first). Presentation, not topic, set the cognitive habit Does training data format shape reasoning strategy more than domain?.

Why would format have such leverage? A clue comes from work showing that reasoning ability isn't really *created* during training — it's already latent in the base model, and training mostly selects and routes it. Several independent methods all elicit reasoning that pre-exists in the base model's activations rather than installing it Do base models already contain hidden reasoning ability?, and RL post-training has been characterized as teaching a model *when* to deploy reasoning, not *how* to reason Does RL post-training create reasoning or just deploy it?. If the raw capability is already there, then the training signal's main job is to shape *strategy and deployment* — exactly the lever that format pulls.

There's a deeper version of this idea: what transfers in reasoning is *procedural* knowledge — the how-to patterns drawn from many documents — rather than fact-specific recall, which depends on narrow memorization Does procedural knowledge drive reasoning more than factual retrieval?. Format is essentially a procedural template. A multiple-choice layout teaches a procedure (enumerate, compare, eliminate); a free-form prompt teaches another (commit, elaborate, follow through). The model absorbs the procedure regardless of whether the questions were about math or medicine. This also fits the finding that knowledge and reasoning live in different parts of the network — facts in lower layers, reasoning adjustments in higher ones — so domain content and reasoning strategy can be tuned somewhat independently Why does reasoning training help math but hurt medical tasks?.

But the format-driven strategy is fragile in a revealing way. Chain-of-thought reasoning degrades predictably once you shift the task, length, or *format* away from what the model trained on — models keep producing fluent reasoning that's structurally familiar but logically hollow, imitating the *form* without the underlying logic Does chain-of-thought reasoning actually generalize beyond training data?. That's the flip side of the headline finding: if format is what's really being learned, then a format the model hasn't seen is exactly what breaks it. The same brittleness shows up with input length, where accuracy collapses well below the context limit in a task-agnostic way Does reasoning ability actually degrade with longer inputs?.

The practical upshot — the thing you might not have known you wanted to know — is that reasoning strategies behave like steerable, almost modular settings rather than deep properties of subject expertise. Verbose vs. concise reasoning turns out to be a single linear direction you can adjust without retraining at all Can we steer reasoning toward brevity without retraining?, and domain-adaptation methods consistently trade visible gains for hidden costs in format flexibility and reasoning faithfulness How do domain training techniques actually reshape model behavior?. If you want to change *how* a model thinks, you may have more leverage reshaping the format of what it sees than the field it studies.

Sources 9 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Does training data format shape reasoning strategy more than domain content?

Sources 9 notes

Next inquiring lines