How much does training data format shape what reasoning strategy emerges?

This explores whether the *format* of training data — multiple-choice vs. free-form, correct vs. corrupted traces — steers what kind of reasoning a model develops, more than the subject matter or the logic inside it.

This explores whether the shape of training data — how problems and answers are presented — does more to determine a model's reasoning style than what the data is actually about. The corpus answers with a surprisingly emphatic yes, and the cleanest evidence is direct: models trained on multiple-choice data adopt a breadth-first "scan the options" strategy, while free-form training produces depth-first chains — and the format effect outweighs the domain effect by roughly 7.5 to 1 Does training data format shape reasoning strategy more than domain?. Presentation, not topic, sets the cognitive habit.

The reason format dominates becomes clearer once you see how little of "reasoning" the training is actually creating. Several notes converge on the idea that base models already carry latent reasoning ability, and post-training mostly *selects* and *packages* it rather than installing it. Minimal interventions — RL steering, decoding tweaks, feature steering — all surface reasoning that pre-exists in activations Do base models already contain hidden reasoning ability?, and RL post-training appears to teach *when* to deploy reasoning rather than *how* Does RL post-training create reasoning or just deploy it?. If the capability is already there, then what training data does is largely formatting work — which is exactly why a 1.5B model with LoRA-only tuning can match much larger RL models by learning output *organization* instead of new knowledge Can small models reason well by just learning output format?.

The unsettling corollary is that the *content* of reasoning traces matters far less than their *form*. Chain-of-thought exemplars that are logically invalid perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and traces that have been deliberately corrupted teach about as well as correct ones — sometimes generalizing *better* out of distribution Do reasoning traces need to be semantically correct?. The model is learning the scaffolding and rhythm of step-by-step output, not the inferential substance. This is why format imprints so hard: it's the part the model can actually imitate.

But format-shaped reasoning is also brittle reasoning. Because the model absorbs the *form* without the underlying logic, it breaks predictably when the presentation shifts: DataAlchemy experiments show chain-of-thought degrading systematically under changes in task, length, and format — producing fluent but logically inconsistent output Does chain-of-thought reasoning actually generalize beyond training data?. The same property that makes format a powerful lever for *shaping* reasoning makes it a fault line for *generalizing* it. Worth contrasting with the one ingredient that does seem to travel: broad procedural knowledge drawn from diverse pretraining documents, which transfers across problems in a way that format-mimicry and fact-memorization do not Does procedural knowledge drive reasoning more than factual retrieval?.

If you want to go deeper into the mechanism, two notes zoom into where the format signal actually lives: only ~20% of tokens — the high-entropy "forking points" — carry the reasoning learning signal Do high-entropy tokens drive reasoning model improvements?, and reasoning verbosity turns out to be a single steerable direction in activation space Can we steer reasoning toward brevity without retraining?. And for the limiting case — what happens when you strip task-specific format away entirely — Quiet-STaR shows reasoning can emerge as a side effect of predicting *any* text, format-free Can models learn reasoning from predicting any text?. The takeaway the corpus leaves you with: training format isn't a cosmetic choice about how answers look — it's one of the strongest levers you have over how a model thinks, precisely because the model is imitating form more than it's reasoning from content.

Sources 11 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

How much does training data format shape what reasoning strategy emerges?

Sources 11 notes

Next inquiring lines