Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does training data format shape reasoning strategy more than domain?

What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.

Note · 2026-02-22 · sourced from Reasoning Methods CoT ToT
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

The CoT Encyclopedia paper isolates two variables that could explain differences in reasoning strategy across models: training data domain (math vs. commonsense vs. coding) and training data format (multiple-choice vs. free-form). The finding is striking: format effect size reaches Cohen's d up to 1.5, while domain effect is consistently below 0.2 — a 7.5x difference.

The pattern breaks down cleanly:

The practical implication is significant: if you want to control a model's reasoning strategy — whether it explores broadly before committing or digs deep on one path — change the format of its training data, not its domain. This is more tractable than domain curation because format is a presentation decision, not a content decision.

The CoT Encyclopedia goes further: it demonstrates that this formatting signature persists and is controllable. By linearly interpolating model weights between MC-trained and FF-trained versions, you can produce models that smoothly transition in strategy without fine-tuning. Strategy becomes a parameter, not an emergent property.

This connects to Why do reasoning models fail differently at training versus inference?: the entropy collapse problem may be partly a format artifact. MC-training produces BFS-like exploration (more diverse across paths); FF-training produces the collapse-prone depth-first profile that RL training then further narrows.

The finding also challenges the assumption that domain-specific training creates domain-specific reasoning styles. What changes domain-to-domain is not the reasoning strategy but the knowledge being applied. The strategy is set by format earlier in training.

RLVR spurious rewards confirm pretraining format as the controlling variable: The spurious rewards finding provides independent evidence. Qwen2.5-Math improves nearly as much with random, incorrect, or format-only rewards as with ground-truth rewards (~21-25% improvement). But Llama3.1 and OLMo2 fail completely with the same spurious rewards. The critical difference: Qwen's pretraining included extensive code-reasoning data, creating a latent "code reasoning" strategy that surfaces under any optimization pressure. The reward signal's content is irrelevant — what matters is that Qwen's pretraining format created a reasoning strategy that RLVR can activate regardless of reward quality. This is the format-dominance principle at the pretraining level: Qwen's code-format pretraining determines its RLVR responsiveness more than any post-training variable. See Why do random rewards improve reasoning for some models but not others?.

FinCoT extends this principle from training time to inference time. By embedding expert-derived reasoning blueprints (as Mermaid diagrams) within structured CoT prompts for financial reasoning, FinCoT improves accuracy from 63.2% to 80.5% while reducing generated tokens eightfold compared to unstructured CoT. The format-over-content principle holds bidirectionally: both training data format and prompt format shape reasoning strategy more than domain content. Domain-specific expert structure in the prompt acts as a format intervention, producing structured reasoning traces that align with expert practice. This connects format effects to domain specialization without requiring domain-specific training.

The same principle operating at a finer scale within Long CoT: Models trained on Long CoT demonstrations where 50% of numbers are randomly replaced achieve only 3.2% lower accuracy than those trained on correct samples. Shuffling 67% of reasoning steps causes a 13.3% accuracy drop. What distillation transfers is the structural architecture of reasoning (reflection, backtracking, self-validation sequences), not the specific content of individual steps. Format dominance extends inward: not just which training format produces which strategy, but within a format, the structural template matters more than factual content. See What do models actually learn from chain-of-thought training?.


Source: Reasoning Methods CoT ToT, RLVR

Related concepts in this collection

Concept map
28 direct connections · 216 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

training data format shapes llm reasoning strategy more than domain content